Spectral clustering

In words...

Spectral clustering is an efficient and versatile method for difficult or complex clustering problems. It is rather different from center-based methods in that it does not build local models of the groups. As a result, the clustering is not based on a predefined model structure, but rather on an assumption on the connectivity of the points: informally, there should be a path between any two points of a group that jumps from points to points with only small jumps, while there is no such path for points of different groups.

An important feature of spectral clustering is that it can easily separate between two arbitrarily shaped groups of points as soon as these are well enough separated, meaning that that the smallest distance between points of different groups is significantly larger than the largest distance between any two neighboring points of a group.

Spectral clustering computations are based on a symmetric affinity matrix containing a measure of similarity for any pair of points. While the default method uses a similarity measure based on the Euclidean distance, this makes the method very versatile as it sufficies to define a new similarity measure suitable for a problem to dedicate the method to this problem and its data distribution.

Spectral clustering can be interpreted as a convex relaxation of the normalized-cut problem, a particularly difficult instance of graph-cut. As such, it offers a computational adavantage as it relies exclusively on standard linear algebra operations.

In pictures...





In maths...

Algorithm

Given a data set $\{\g x_i\}_{i=1}^N$ to cluster into $n$ groups, spectral clustering applies the following procedure.

  1. Compute the symmetric affinity matrix $\g A \in \R^{N\times N}$ of entries $$ A_{ij} = A_{ji} = \begin{cases} s(\g x_i, \g x_j), & \mbox{if } i\neq j\\ 0, & \mbox{otherwise}, \end{cases} $$ where $s(\g x_i, \g x_j) \geq 0$ is a symmetric measure of similarity between $\g x_i$ and $\g x_j$.
  2. Let $\g D = diag(\g A\g 1)$ and $\g L = \g D - \g A$.
  3. Compute the first eigenvectors $\g u_1, \dots, \g u_n$ of the matrix $\g L$ with smallest eigenvalues.
  4. Apply a standard clustering method, such as $K$-means, to cluster the $N$ rows of $\g U = [\g u_1,\dots, \g u_n]$. This yields the final clustering of the data as each row of $\g U$ represents a data point in some eigenspace.

Why it works

The affinity matrix $\g A$ can be interpreted as the adjacency matrix of the so-called similarity graph $G=(V,E)$, an undirected weighted graph of vertices $V = \{\g x_i\}_{i=1}^N$ and edges $E$. Points with 0 similarity appear disconnected in this graph, while positive similarities weight the edges between vertices representing the data points.

The second step of the algorithm computes the graph Laplacian $$ \g L = \g D - \g A , $$ where $\g D$ is the degree matrix, i.e., a diagonal matrix of entries $D_{ii}$ corresponding to the sum of the weights of the edges connected to the vertex $V_i$: $\g D = diag(\g A\g 1)$. The graph Laplacian has a number of important properties, including the following.

  1. For any vector $\g v\in\R^N$, $$ \g v^T \g L \g v = \frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N A_{ij} ( v_i - v_j)^2. $$

    Proof: \begin{align*} \g v^T \g L \g v & = \g v^T \g D \g v - \g v^T \g A \g v \\ & = \sum_{i=1}^N D_{ii} v_i^2 - \sum_{i=1}^N \sum_{j=1}^N A_{ij} v_i v_j \\ & = \frac{1}{2}\sum_{i=1}^N D_{ii} v_i^2 + \frac{1}{2} \sum_{j=1}^N D_{jj} v_j^2 - \sum_{i=1}^N \sum_{j=1}^N A_{ij} v_i v_j\\ & = \frac{1}{2}\left(\sum_{i=1}^N \sum_{j=1}^N A_{ij} v_i^2 - 2\sum_{i=1}^N \sum_{j=1}^N A_{ij} v_i v_j + \sum_{j=1}^N \sum_{i=1}^N A_{ij} v_j^2 \right) & (\mbox{due to } D_{ii} = \sum_{j=1}^N A_{ij} ) \\ & = \frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N A_{ij} \left( v_i^2 - 2 v_i v_j + v_j^2 \right) \\ & = \frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N A_{ij} \left( v_i - v_j \right)^2 \end{align*}

  2. $\g L$ is symmetric and positive semidefinite, thus it has $N$ non-negative and real-valued eigenvalues.

    Proof: By definition of the affinity matrix, $A_{ij} \geq 0$, $i=1,\dots,N$, $j=1,\dots,N$. Thus, by Property 1 above, $\g v^T \g L \g v \geq 0$ for all $\g v\in\R^N$.

  3. The smallest eigenvalue of $\g L$ is $\lambda_1 = 0$ and the corresponding eigenvector is $\g u_1 = \g 1$.

    Proof: We can check that $\g u_1$ is an eigenvector with eigenvalue $\lambda_1$: $$\g L \g u_1 = \g L \g 1 = \g D \g 1 - \g A \g 1 = diag(\g D) - diag(\g D) = \g 0 = \lambda_1 \g u_1.$$ In addition, by Property 2, there cannot be negative eigenvalues and thus $\lambda_1$ is the smallest one.

In addition, we have the following relationship between the number of connected components of the graph and the spectrum of $\g L$.

Let $G$ be an undirected weighted graph with non-negative weights, $A_{ij} \geq 0$. Then, the multiplicity $n$ of the eigenvalue 0 of its graph Laplacian $\g L$ equals the number of connected components, $G_1 \subseteq V, \dots, G_k\subseteq V$ in the graph. In addition, the eigenspace of eigenvalue 0 is spanned by the indicator vectors $\g 1_{G_1}, \dots, \g 1_{G_n}$ of these components where $(\g 1_S)_i = 1$ if $i\in S$ and 0 otherwise.

Proof: Consider first the case $n=1$, i.e., when $G$ forms a single connected component. Then, if $\g v\in \R^N$ is an eigenvector of $\g L$ with eigenvalue 0, $\g L \g v = \g 0$ and $\g v^T \g L \g v = 0$. Thus, by the properties above, $$ \frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N A_{ij} ( v_i - v_j)^2 = 0. $$ For this to hold with non-negative weights $A_{ij}$, we need all the terms of the sum to be 0, and thus $v_i = v_j$ for all pairs $(i,j)$ such that $A_{ij} \neq 0$, resulting in $\g v$ being a constant vector if all pairs of vertices are connected with a path in the graph. Thus, $\g v = \g 1$ and the eigenvalue 0 has multiplicity 1 as there is no other linearly independent eigenvector associated to this value.
For the case $n\geq 2$, the adjacency matrix is block-diagonal since, for all $i$, $A_{ij} = 0$ for any $j$ such that $V_j$ is not connected to $V_i$. As a direct consequence, the Laplacian is also block-diagonal: $$ \g L = \begin{bmatrix} \g L_1 & & & \\ & \g L_2 & & \\ & & \ddots & \\ & & & \g L_n \end{bmatrix} $$ with one block for each connected component. As for any block-diagonal matrix, its eigenvalues are the eigenvalues of its blocks $\g L_i$ and the eigenvectors are given by the eigenvectors of $\g L_i$ filled with zeros on the entries aligned with the other blocks. In addition, each block $\g L_i$ is a proper graph Laplacian for a subgraph corresponding to the $i$th connected component $G_i$ and as such enjoys the property that $\g u_i= \g 1$ is an eigenvector with eigenvalue 0.
Therefore, $\g L$ has eigenvalue 0 with multiplicity $n$ and the corresponding eigenvectors are indicator vectors of the connected components.

These properties give us access to the indicator vectors of the graph components simply by computing a few eigenvectors of its graph Laplacian. This means that, if we can build the graph such that the connected components represent the desired clusters, we can efficiently recover the desired clustering of the data points.

Therefore, the most difficult part in applying spectral clustering consists in crafting a suitable similarity measure in order to build a relevant similarity graph with a clustered structure, or, equivalently, a block-diagonal affinity matrix. In most practical cases, the affinity matrix does not have this ideal block-diagonal structure, but nonetheless can be close to that, meaning that it is the sum of a block-diagonal matrix and one with small entries. In this case, K-means should be able to recover the correct clustering in eigenspace even if the eigenvectors are not exactly piecewise constant over the clusters.

Common similarity measures

Most common similarity measures $s(\g x_i, \g x_j)$ are based on the Euclidean distance $\|\g x_i - \g x_j\|$. But in the few possibilities outlined below, one can easily substitute another metric for the distance in order for instance to deal with particular point distributions in more complex problems.

$\epsilon$-neighborhood graph

An $\epsilon$-neighborhood graph is constructed by choosing $$ s(\g x_i, \g x_j) = \begin{cases} 1, & \mbox{if } \|\g x_i - \g x_j\| \leq \epsilon\\ 0, &\mbox{otherwise,} \end{cases} $$ where $\epsilon$ is a parameter tuning the size of the neighborhood within which points are considered as similar.

$K$-nearest neighbors graph

A $K$-nearest neighbors graph is constructed by choosing $$ s(\g x_i, \g x_j) = \begin{cases} s^\prime (\g x_i, \g x_j), & \mbox{if } \g x_j \in \mathcal{N}_K(\g x_i) \mbox{ or } \g x_i \in \mathcal{N}_K(\g x_j) \\ 0, &\mbox{otherwise,} \end{cases} $$ where $\mathcal{N}_K(\g x)$ is the set of $K$-nearest neighbors of $\g x$ in the data set and $s^\prime (\g x_i, \g x_j)$ is a continuous similarity function such as the one below.

Another possibility based on $K$-nearest neighbors is to use $$ s(\g x_i, \g x_j) = \begin{cases} s^\prime (\g x_i, \g x_j), & \mbox{if } \g x_j \in \mathcal{N}_K(\g x_i) \mbox{ and } \g x_i \in \mathcal{N}_K(\g x_j) \\ 0, &\mbox{otherwise,} \end{cases} $$ which results in a mutual $K$-nearest neighbors graph.

Fully connected graph with the Gaussian RBF similarity function

A fully connected graph is obtained with the Gaussian RBF similarity function (or the Gaussian RBF kernel) $$ s(\g x_i, \g x_j) = \exp\left(\frac{-\|\g x_i - \g x_j\|^2}{2\sigma^2} \right), $$ where $\sigma$ is the parameter tuning the bandwidth of the Gaussian, i.e., the width of the neighborhood within which the points are considered as similar.

Despite resulting in a fully connected graph, this similarity measure is often used in practice as most entries in the affinity matrix will be close to 0 and thus playing a role close to nonexistant edges.

Variants

Variants of spectral clustering use the eigenvectors of normalized versions of the graph Laplacian:
  1. The symmetric normalized Laplacian: $\g L_{sym} = \g D^{-1/2} \g L \g D^{-1/2} = \g I -\g D^{-1/2} \g A \g D^{-1/2}$ .
  2. The nonsymmetric one: $\g L_{nonsym} = \g D^{-1} \g L = \g I - \g D^{-1}\g A$.
The symmetric normalization requires the addition of another normalization step in the algorithm: K-means is applied to $\g U_n$ built by normalizing the rows $\g U_j^T$ of $\g U = [\g u_1, \dots, \g u_n]$ as $$ \g U_n = \begin{bmatrix} \frac{\g U_1^T }{\| \g U_1\| } \\ \vdots\\ \frac{\g U_N^T }{\| \g U_N\| } \end{bmatrix} = \begin{bmatrix} \frac{u_{11}}{\sqrt{\sum_{j=1}^n u_{j1} } } & \dots & \frac{u_{n1}}{\sqrt{\sum_{j=1}^n u_{j1} } } \\ \vdots\\ \frac{u_{1N}}{\sqrt{\sum_{j=1}^n u_{jN} } } & \dots & \frac{u_{nN}}{\sqrt{\sum_{j=1}^n u_{jN} } } \end{bmatrix}. $$

Notes

The variant with the symmetric normalized Laplacian has been proposed in and the nonsymmetric one is discussed in . The good tutorial provides historical notes on the original spectral clustering algorithm using the unnormalized Laplacian.