VC-dimension of hyperplanes

In words...

In binary classification, the VC-dimension is a measure of the capcity of a function class that can be used to derive error bounds for infinite function classes.

The VC-dimension of the set of linear classifiers is proportional to the data dimension, and equals the number of parameters of the classifiers.

More precisely, if we consider the restricted class of linear classifiers with separating hyperplanes passing through the origin, its VC-dimension is precisely the dimension of the input space. If we assume the usual definition of linear classifiers and allow for an offset between the hyperplane and the origin, then the VC-dimension is the dimension of the input space plus one. Note that in this case the classifiers have an additional parameter (the offset) and thus the VC-dimension equals the number of parameters in both cases.

In pictures...

Linear classifiers with hyperplanes passing through the origin

Here, we illustrate the VC-dimension of the class of linear classifiers in $\R^2$ by showing how linear classifiers can shatter a set of 2 points.
Here is a list of all possible labelings of these 2 points:
$y_1$ $y_2$
+1 +1
+1 -1
-1 +1
-1 -1

Exercise: For each labeling, try to find a linear classifier (i.e., a straight line passing through the origin) that produce this classification.
(hover over the rows to see the solution)

Since there is a linear classifier for every possible combination of labels of these 2 points, the VC-dimension is at least 2.
In addition, since we cannot find a set of 3 points with the same property, the VC-dimension is lower than 3 and thus equals 2.



Hyperplanes with an offset (affine classifiers)

Here, we illustrate the VC-dimension of the class of affine classifiers in $\R^2$, which correspond to hyperplanes with an offset. Such hyperplanes can shatter a set of 3 points, but no more.
Here is a list of all possible labelings of these 3 points:
$y_1$ $y_2$$y_3$
+1 +1 +1
+1 +1 -1
+1 -1 +1
+1 -1 -1
-1 +1 +1
-1 +1 -1
-1 -1 +1
-1 -1 -1

Exercise: For each labeling, try to find an affine classifier (i.e., a line) that produces this classification.
(hover over the rows to see the solution)

Since there is an affine classifier for every possible combination of labels of these 3 points, the VC-dimension is at least 3.
Now, click in the plot to add a point and see if the same property holds for 4 points...






In maths...

We will show the following result.

The VC-dimension of the set of linear classifiers in $\R^d$, $$ \F = \{ f\ :\ f(\g x) = \sign( \g w^T \g x ) , \g w \in \R^d\}, $$ is $$ \VC(\F) = d. $$

Step 1: lower bound on the VC-dimension

Consider the set of points $S = \{\g x_1,\dots, \g x_d\}$ made of the vectors of the canonical basis of $\R^d$, i.e., the $j$th component of $\g x_i$ is given by $x_{ij} = \delta_{i,j}$, where the Kronecker symbol $\delta_{i,j}$ is 1 when $i=j$ and $0$ otherwise: $$ \g x_1 = \begin{bmatrix}1\\0\\\vdots\\0\end{bmatrix},\quad \g x_2 = \begin{bmatrix}0\\1\\0\\\vdots\\0\end{bmatrix},\quad \dots,\ \g x_d = \begin{bmatrix}0\\\vdots\\0\\1\end{bmatrix}. $$ This set of points can be shattered by $\F$, since all possible labelings $(y_1,\dots,y_d) \in \{-1,+1\}^d$ of $S$ can be produced by the linear classifiers $$ f(\g x) = \sign\left( \sum_{i=1}^d y_i \g x_i^T \g x \right) $$ using $\g w = \sum_{i=1}^d y_i \g x_i$. To see this, note that $\g x_i^T \g x_j = 0$ if $i\neq j$ and $1$ otherwise, which implies $f(\g x_j) = \sign(y_j) = y_j$. (see the pictures above for an illustration of these classifiers on this particular data set).
Since there is a set of $d$ points shattered by $\F$, we have $$ \VC(\F) \geq d. $$

Step 2: upper bound

Assume now that there is a set $S = \{\g x_1,\dots, \g x_{d+1}\}$ of $d+1$ points shattered by $\F$. This implies that for any of the $2^{d+1}$ possible labelings there is a parameter vector $\g w_k$ such that $f_k(\g x) = \sign(\g w_k^T\g x)$ produces this labeling. Consider the matrix concatening all the real-valued outputs of these classifiers on $S$: $$ \g H = \begin{bmatrix} \g w_1^T \g x_1 & \g w_2^T \g x_1 & \dots & \g w_{2^{d+1}}^T \g x_1 \\ \vdots & \ddots \\ \g w_1^T \g x_{d+1} & \dots & & \g w_{2^{d+1}}^T \g x_{d+1} \end{bmatrix} = \g X \g W $$ where we introduced $\g X = [\g x_1, \dots, \g x_{d+1}]^T$ and $\g W = [\g w_1,\dots, \g w_{2^{d+1}}]$.
On the one hand, every labelings, i.e., combinations of signs, are represented by the columns of $\sign(\g H)$. On the other hand, the rows of $\g H$ are linearly independent since there is no vector $\g a\neq \g 0$ such that $\g a^T \g H = \g 0^T$.
To see this, note that the $k$th entry of $\g a^T \g H$ equals $\g a^T \g X \g w_k$, and that there is a $k$, for which we have $\sign(\g X\g w_k) = \sign(\g a)$, i.e., the same sign pattern in the two vectors of a dot product. As a result, $\g a^T \g X \g w_k$ is a sum of positive numbers, which cannot be 0.

The linear independence of the $d+1$ rows of $\g H$ implies that $rank(\g H) = d+1$. But since $\g H = \g X \g W$, we also have $$ rank(\g H) \leq \min\{ rank(\g X), rank(\g W) \} \leq d $$ due to $\g X$ having only $d$ columns and the fact that the rank of a matrix cannot be larger than its smallest dimension. Therefore, there is a contradiction between these two implications on the rank of $\g H$, which shows that the assumption does not hold. Hence, there is no set of $d+1$ points shattered by $\F$ and $$ \VC(\F) \leq d. $$ Since we already proved that $\VC(\F) \geq d$, we conclude that $$ \VC(\F) = d. $$

Affine classifiers

As a slight abuse of notation, the term "linear classifiers" is usually employed to refer to affine classifiers of the class $$ \F^\prime = \{ f\ :\ f(\g x) = \sign( \g w^T \g x + b ) , \g w \in \R^d,\ b\in\R\}, $$ with an additional parameter introducing an offset between the separating hyperplane and the origin. For this class, we have the following result.

The VC-dimension of the set of affine classifiers $\F^\prime$ is $$ \VC(\F^\prime) = d+1. $$

In order to compute the VC-dimension of this class, we could first lower bound it as $\VC(\F^\prime) \geq \VC(\F) = d$ simply by noting that $\F \subset \F^\prime$. But in fact, more is true. In particular, we can show that $\VC(\F^\prime) \geq d+1$.
Consider again the set of points $S = \{\g x_1,\dots, \g x_d\}$ made of the vectors from the canonical basis of $\R^d$, and extend it with the origin: $S^\prime = S\cup\{\g x_{d+1}\}$ with $\g x_{d+1} = \g 0$. For any labeling $\g y=(y_1,\dots,y_{d+1})$ of $S^\prime$, choose the affine classifier $f^\prime\in\F^\prime$ as $$ f^\prime(\g x) = \sign\left( \sum_{i=1}^d ( y_i - y_{d+1} )\g x_i^T \g x + y_{d+1} \right), $$ i.e., with $\g w = \sum_{i=1}^d ( y_i - y_{d+1} )\g x_i$ and $b = y_{d+1}$. By computing the output of this classifier on $S^\prime$, we see that it produces the labeling $\g y$: $$ f^\prime(\g x_j) = \sign\left(\sum_{i=1}^d ( y_i - y_{d+1}) \g x_i^T \g x_j + y_{d+1} \right) = \begin{cases} \sign\left((y_j - y_{d+1}) + y_{d+1}\right) = y_j,& \mbox{if } j \leq d\\ \sign\left(y_{d+1} \right) = y_{d+1},& \mbox{if } j =d+1 . \end{cases} $$ (see the pictures above for an illustration of these classifications on this particular data set). Thus, there is a set of $d+1$ points, $S^\prime$, shattered by $\F^\prime$ and $$ \VC(\F^\prime) \geq d+1. $$

The upper bound on the VC-dimension is a straightforward consequence of the VC-dimension of linear classifiers and the fact that an affine classifier $f^\prime$ in $\R^d$ is a linear classifier $f$ operating on a subspace of $\R^{d+1}$: $$ \forall \g x^{\prime}\in\R^d,\quad f^\prime(\g x^\prime) = \sign( {\g w^\prime}^T \g x^\prime + b^\prime ) = \sign( \g w^T \g x ) = f(\g x) $$ with $$ \g x = \begin{bmatrix}\g x^\prime\\1\end{bmatrix} \in\R^{d+1} ,\quad \mbox{ and } \quad \g w = \begin{bmatrix}\g w^\prime\\b\end{bmatrix} \in \R^{d+1} . $$ Since the VC-dimension of linear classifiers in $\R^{d+1}$ is $d+1$, there is no set of $d+2$ points in $\R^{d+1}$ shattered by linear classifiers, and thus no set of $d+2$ points shattered by affine classifiers in $\R^d$. Therefore, $$ \VC(\F^\prime) \leq d+1 . $$ Combining the lower and upper bounds yields the final result.