Fat-shattering dimension for linear classifiers

In words...

In binary classification, the fat-shattering dimension is a capacity measure generalizing the Vapnik-Chervonenkis dimension to classes of real-valued functions used as score functions by classifiers and that can be used to derive margin-based error bounds.

The fat-shattering dimension of a set of linear functions used by linear classifiers can be bounded by a square function of the ratio of the radius of ball enclosing the data and the scale at which the dimension is computed.

Interestingly, the bound on the fat-shattering dimension of linear classifiers does not involve the dimension of the input space. Thus, contrary to the VC dimension of linear classifiers, the fat-shattering dimension of linear functions remains finite in infinite dimensional Hilbert spaces.

In pictures...

Fat-shattering a set of points by linear classifiers

In maths...

Consider the set of real-valued functions at the core of linear classifiers restricted to parameter vector with bounded size: $$ \G = \{ g \in \R^{\X} : g(\g x) = \inner{\g w}{\g x},\quad \|\g w\| \leq R_{\g w} \} . $$ Then, its fat-shattering dimension is bounded by $$ \fat_{\gamma}(\G) \leq \frac{ R_{\g w}^2 R_{\g x}^2 }{\gamma^2} $$ where $ R_{\g x}$ is radius of the data-enclosing ball ($\max_{\g x\in \X} \|\g x\| \leq R_{\g x}$ ) and $\gamma$ is the scale at which the fat-shattering dimension is computed.

Proof:

  1. If a set $\{\g x_i\}_{i=1}^n$ is $\gamma$-shattered by $\G$, then, for all subset $I_0\subseteq I=\{1,\dots,n\}$, we have $\|\sum_{i\in I_0} \g x_i - \sum_{_i\in I\setminus I_0} \g x_i\| \geq n\gamma/ R_{\g w}$.

    If $\{\g x_i\}_{i=1}^n$ is $\gamma$-shattered by $\G$, say with a witness $\{r_i\}_{i=1}^N$, then for all labelings $\g y \in \{-1,+1\}^{n}$ , there is a $\g w$ with $\|\g w \| \leq R_{\g w}$ such that $$ y_i( \inner{\g w}{\g x_i} - r_i) \geq \gamma,\ i=1,\dots,n. $$ Assume that $I_0$ is such that $$ \sum_{_i\in I_0} r_i \geq \sum_{_i\in I\setminus I_0} r_i $$ and consider the labeling with $y_i = 1$ if $i \in I_0$ and $y_i= -1$ otherwise. Then, we have $$ \begin{cases} \inner{\g w}{\g x_i} \geq r_i + \gamma,& \mbox{if } i \in I_0, \\ \inner{\g w}{\g x_i} \leq r_i - \gamma,& \mbox{if } i \in I\setminus I_0, \end{cases} $$ which leads to $$ \inner{\g w}{\sum_{i\in I_0} \g x_i} = \sum_{i\in I_0} \inner{\g w}{ \g x_i} \geq \sum_{i\in I_0} r_i + |I_0|\gamma $$ and $$ \inner{\g w}{\sum_{i\in I\setminus I_0} \g x_i} = \sum_{i\in I\setminus I_0} \inner{\g w}{ \g x_i} \leq \sum_{i\in I\setminus I_0} r_i - (n-|I_0|) \gamma . $$ Thus, by using the assumption above, $$ \inner{\g w}{\sum_{i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i} \geq \sum_{i\in I_0} r_i -\sum_{_i\in I\setminus I_0} r_i + |I_0|\gamma + (n-|I_0|) \gamma \geq n\gamma . $$ By the Cauchy-Schwarz inequality, $\inner{\g w}{\sum_{i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i} \leq \|\g w\| \|\sum_{i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i \| \leq R_{\g w} \|\sum_{i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i\|$, so that $\|\sum_{i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i\|$ must be at least $n\gamma / R_{\g w}$ for the above to hold.
    Since we obtain a similar result in the case $\sum_{_i\in I_0} r_i < \sum_{_i\in I\setminus I_0} r_i$ by looking at the opposite labeling where $y_i=1$ if and only if $i\in I\setminus I_0$, the statement holds in all cases.

  2. For all set $\{\g x_i\}_{i=1}^n$ such that $\|\g x_i\| \leq R_{\g x}, i=1,\dots,n$, there is some index subset $I_0 \subset I$ that satisfies $\|\sum_{ i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i\| \leq \sqrt{n} R_{\g x}$.

    Given a set of indexes $I=\{1,\dots,n\}$, we let the labels $y_i$ be independent and uniform random variables taking values in $\{-1,+1\}$ and consider the random subset $I_0=\{ i\in I : y_i = 1\}$. Then, we compute the following expectation \begin{align} \E_{\g y} \left\|\sum_{ i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i \right\|^2 &= \E_{\g y} \left\| \sum_{i\in I} y_i \g x_i \right\|^2 \\ &= \E_{\g y} \inner{\sum_{i\in I} y_i \g x_i}{\sum_{j\in I} y_j \g x_j} \\ &= \E_{\g y} \sum_{i\in I} \inner{ y_i \g x_i}{\sum_{j\in I} y_j \g x_j} \\ &= \sum_{i\in I} \E_{\g y}\inner{ y_i \g x_i}{\sum_{j\in I} y_j \g x_j} \\ &= \sum_{i\in I} \sum_{j\in I} \E_{y_i,y_j} \inner{ y_i \g x_i}{ y_j \g x_j} \\ &= \sum_{i\in I} \sum_{j\neq i}\E_{y_i,y_j} \inner{y_i \g x_i}{ y_j \g x_j} + \E_{y_i}\inner{ y_i \g x_i}{ y_i \g x_i}\\ &= \sum_{i\in I} \sum_{j\neq i} \E_{y_i,y_j} y_i y_j \inner{\g x_i}{ \g x_j} + \E_{y_i}\| y_i \g x_i\| \\ &= \sum_{i\in I} \E_{y_i}\| y_i \g x_i\| & ( \mbox{due to } y_i, y_j \mbox{ being independent and centered )}\\ & \leq n R_{\g x}^2 \end{align} Since the expectation over all labelings is smaller than $ n R_{\g x}^2$, there must be at least one labeling (and thus one set $I_0$) for which $\left\|\sum_{ i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i \right\|^2 \leq n R_{\g x}^2$.

  3. Conclusion: by combining steps 1 and 2, we obtain that for all set $\{\g x_i\}_{i=1}^n$ that is $\gamma$-shattered by $\G$, there is some index subset $I_0$ such that $$ \frac{n \gamma}{ R_{\g w} } \leq \left\|\sum_{i\in I_0} \g x_i - \sum_{i\in I\setminus I_0} \g x_i\right\| \leq \sqrt{n}R_{\g x} $$ and thus that $$ \sqrt{n} \leq \frac{ R_{\g w} R_{\g x} }{\gamma} $$ which concludes the proof.