K-means clustering

In words...

K-means is perhaps the most popular method for center-based clustering. It provides a fast algorithm based on a simple idea.

Starting from a random initialization of the centers of the groups, K-means assigns each data point to the group with the closest center. Then, the centers are recomputed from the new groups of points and this process is iterated until convergence.

This scheme can be shown to monotonically decrease the sum of squared distances from points to their group-centers in order to produce concentrated groups of points with low variance.

The simplicity of the two steps above leads to a very computationally efficient algorithm. However, the resulting clustering need not be optimal in terms of minimizing the sum of squared distances from points to their group-centers. In practice, the algorithm is typically restarted multiple times from different initializations and the best clustering (according to the sum of squared distances criterion) is retained.

This is another example of the curse of dimensionality: the higher the dimension is, the harder it is to sufficiently sample the initialization space and the less likely the algorithm is to find a global minimum.

In pictures...

Step-by-step illustration of K-means

This unlabeled data set will be clustered into groups

You can click in the plot to add points and experiment with different numbers of groups.

In maths...

$K$-means is a simple and computationally efficient clustering algorithm that aims at solving $$ \min_{\{\g c_j\}_{j=1}^n, y_i\in \{1,\dots, n\}} \ \sum_{i=1}^N \| \g x_i - \g c_{y_i} \|_2^2 , $$ where $n$ centers $\g c_j \in \R^d$ represent $n$ groups and the classification of the $N$ data points $\g x_i$ is represented by the labels $y_i$.

It can be interpreted as a block-coordinate descent method alternatively optimizing over the centers $\g c_j$ and the labels $y_i$. Specifically, it performs the following steps.

  1. Randomly initialize the centers $\g c_j \in \R^d$, $j=1,\dots,n$.
  2. Classify the data points with $$ y_i = \arg\min_{j\in\{1,\dots,n\} } \ \| \g x_i - \g c_{j} \|_2^2 . $$
  3. Recompute the group centers as the means of the points in the groups: $$ \g c_j = \frac{1}{\sum_{i=1}^N \I{y_i = j} } \sum_{y_i=j} \g x_i,\quad j=1,\dots,n $$
  4. Loop from step 2 until convergence, i.e., while changes in the classification occur or the update of $\g c_j$ is numerically significant.

Since the block-coordinate descent approach to the nonconvex optimization problem above is only guaranteed to reach a local minimum, a typical practice is to restart the algorithm multiple times from different initializations:

  1. Choose a number of restarts, e.g., $r=100$.
  2. For $k=1$ to $r$,
    1. apply the algorithm above;
    2. compute the clustering cost $$ J(k) = \sum_{i=1}^N \| \g x_i - \g c_{y_i} \|_2^2 . $$
  3. Return the centers and labels obtained at iteration $k^*$ with $$ k^* = \arg\min_{k=1,\dots,r} J(k). $$

Empty clusters

One difficulty with K-means is that a group can become empty, making the computation of its center inapplicable. Various strategies exist to deal with such an event.

Variants

Other clustering algorithms based on similar ideas can be devised by changing the distance metric used in the cost function. For instance, replacing the squared Euclidean norm $\| \g x_i - \g c_{y_i} \|_2^2 $ by the $\ell_1$-norm $\| \g x_i - \g c_{y_i} \|_1$ yields the K-median algorithm, in which the centers are the medians of the groups instead of the means.