K-means is perhaps the most popular method for center-based clustering. It provides a fast algorithm based on a simple idea.

Starting from a random initialization of the centers of the groups, K-means assigns each data point to the group with the closest center. Then, the centers are recomputed from the new groups of points and this process is iterated until convergence.

This scheme can be shown to monotonically decrease the sum of squared distances from points to their group-centers in order to produce concentrated groups of points with low variance.

The simplicity of the two steps above leads to a very computationally efficient algorithm. However, the resulting clustering need not be optimal in terms of minimizing the sum of squared distances from points to their group-centers. In practice, the algorithm is typically restarted multiple times from different initializations and the best clustering (according to the sum of squared distances criterion) is retained.

This is another example of the curse of dimensionality: the higher the dimension is, the harder it is to sufficiently sample the initialization space and the less likely the algorithm is to find a global minimum.

This unlabeled data set will be clustered into groups

You can click in the plot to add points and experiment with different numbers of groups.

It can be interpreted as a block-coordinate descent method alternatively optimizing over the centers $\g c_j$ and the labels $y_i$. Specifically, it performs the following steps.

- Randomly initialize the centers $\g c_j \in \R^d$, $j=1,\dots,n$.
- Classify the data points with $$ y_i = \arg\min_{j\in\{1,\dots,n\} } \ \| \g x_i - \g c_{j} \|_2^2 . $$
- Recompute the group centers as the means of the points in the groups: $$ \g c_j = \frac{1}{\sum_{i=1}^N \I{y_i = j} } \sum_{y_i=j} \g x_i,\quad j=1,\dots,n $$
- Loop from step 2 until convergence, i.e., while changes in the classification occur or the update of $\g c_j$ is numerically significant.

Since the block-coordinate descent approach to the nonconvex optimization problem above is only guaranteed to reach a local minimum, a typical practice is to restart the algorithm multiple times from different initializations:

- Choose a number of restarts, e.g., $r=100$.
- For $k=1$ to $r$,
- apply the algorithm above;
- compute the clustering cost $$ J(k) = \sum_{i=1}^N \| \g x_i - \g c_{y_i} \|_2^2 . $$

- Return the centers and labels obtained at iteration $k^*$ with $$ k^* = \arg\min_{k=1,\dots,r} J(k). $$

- Terminate and return an error (or an infinite cost in a multiple restart procedure)
- Reset the group center at random.
- Reset the group center to the point with maximal distance from its own group center;
- Drop the group and continue with $n\leftarrow n-1$.