# Sauer's Lemma

## In words...

Sauer's Lemma provides a polynomial bound on the growth function of a class of binary classifiers based on its VC-dimension.

## In pictures...

Below is a plot of Sauer's bound on the growth function. The plot is in log-scale (a linear curve denotes an exponential growth).

The plot shows that the growth function grows exponentially with the number of points as $2^n$ only until this number reaches the VC-dimension, since it is bounded by a polynomial function when the number of points is larger than the VC-dimension.

## In maths...

In a binary classification setting, we have the following combinatorial result.

Given a function class $\F$ of VC-dimension $\VC(\F)$, its growth function is bounded for all $n > \VC(\F)$ by $$\Pi_{\F}(n) \leq \sum_{k=0}^{\VC(\F)} \begin{pmatrix}n\\k\end{pmatrix} \leq \left( \frac{e n}{\VC(\F)}\right)^{\VC(\F)} .$$

Note: $e=\exp(1)$ in the above.

Proof: The proof works by induction simultaneously on both the VC-dimension and the set size $n$, i.e., at a constant sum $n+\VC(\F) = k$. We use the extended definition of the binomial coefficients: $$\begin{pmatrix}n\\k\end{pmatrix} = \begin{cases} \frac{n!}{k! (n-k)!}, &\mbox{if } 0\leq k \leq n \\ \\ 0, &\mbox{otherwise}. \end{cases}$$

1. Base cases: At $n=0$, for any $\VC(\F)$, we have $\Pi_{\F}(0) = 0$ while $$\sum_{k=0}^{\VC(\F)} \begin{pmatrix}n\\k\end{pmatrix} = \begin{pmatrix}0\\0\end{pmatrix} + \sum_{k=1}^{\VC(\F)} \begin{pmatrix}0\\k\end{pmatrix} = 1 + 0 = 0$$ For $\VC(\F)=0$, there is no point that can be shattered, which means that the predicted label is fixed at all points of $\X$ whatever the classifier $f\in\F$. In other words, there is only one possible labeling of any set of points and thus, for any $n$, we have $\Pi_{\F}(n) = 1$, which satisfies the bound since $$\sum_{k=0}^{0} \begin{pmatrix}n\\k\end{pmatrix} = \begin{pmatrix}n\\0\end{pmatrix} = 1$$
2. Another way for Base cases: For all $n\leq \VC(\F)$, we have $$\Pi_{\F}(n) \leq 2^n = \sum_{k=0}^{n} \begin{pmatrix}n\\k\end{pmatrix} = \sum_{k=0}^{\VC(\F)} \begin{pmatrix}n\\k\end{pmatrix}$$
3. Inductive step: choose a set of points $S=\{x_1,\dots,x_n\}$ such that each of the $\Pi_{\F}(n)$ labelings of these points corresponds to a different classifier $f\in\F$. Let $F$ denote the set of these classifiers. Let $S^\prime = S \setminus \{x_1\}$ and $F^{\prime}\subseteq F$ be the smallest subset of $F$ that can label $S^\prime$ in the maximum number of ways.
Note that $F^\prime$ cannot contain 2 classifiers that produce the same labeling of $S^\prime$ (since it is the "smallest subset" of $F$). Thus, if 2 classifiers of $F$ produce the same labeling of $S^\prime$, one is in $F^\prime$ and the other is in $F\setminus F^\prime$. This results in a partition of $F$: $F= F^\prime \cup (F\setminus F^\prime)$ with the relation $|F|= |F^\prime| + |F\setminus F^\prime|$.
Note that since we chose $F$ with one function for each labeling of $S$, $|F^\prime|$ and $|F\setminus F^\prime|$ also represent the number of labelings of $S^\prime$ produced by $F^\prime$ and $F\setminus F^\prime$, respectively. Thus: $$|F^\prime|\leq \Pi_{F^\prime}(n-1) ,\quad \mbox{ and }\quad |F\setminus F^\prime| \leq \Pi_{F\setminus F^\prime}(n-1)$$
We will now bound the VC-dimensions of these sets. Since $F^\prime \subseteq F\subseteq \F$, we have $\VC(F^\prime) \leq \VC(F) \leq \VC(\F)$.
If there is a set $T\subseteq S^\prime$ shattered by $F\setminus F^\prime$, then $F^\prime$ also shatters $T$ since $F^\prime$ can produce the maximal number of labelings on $S^\prime$ and thus on all its subsets. This means that for all labelings of $T$, there is a function $f\in F\setminus F^\prime$ and another $f^\prime\in F^\prime$ that produce this labeling. Since $F$ contains a single function for each labeling of $S$, this means that $f$ and $f^\prime$ must disagree on $S\setminus T$. Whatever the output of $f$ on $S^\prime$ is, there is an $f^\prime\in F^\prime$ that can produce this labeling. Thus $f$ and $f^\prime$ must actually disagree on $S\setminus S^\prime = \{x_1\}$: $f(x_1) \neq f^\prime(x_1)$. Since all these pairs of functions are in $F$, we have that $F$ shatters $T\cup \{x_1\}$ (by alternatively considering a function from $F^\prime$ and one from $F\setminus F^\prime$).
If there is a set $T\subseteq S^\prime$ shattered by $F\setminus F^\prime$, then $\VC(F\setminus F^\prime) \geq |T|$. But we also have that $F$ shatters $T\cup \{ x_1\}$. Thus $\VC(F) > \VC(F\setminus F^\prime)$ and $\VC(F\setminus F^\prime) \leq \VC(\F)-1$.
4. Inductive step: Assume the Lemma to hold for a number of points $m = n-1$. This implies the following bounds for $F^\prime$ and $F\setminus F^\prime$: $$\Pi_{F^\prime}(n-1) \leq \sum_{k=0}^{\small\VC(F^\prime)} \begin{pmatrix}n-1\\k\end{pmatrix} \leq \sum_{k=0}^{\small\VC(\F)} \begin{pmatrix}n-1\\k\end{pmatrix}$$ and $$\Pi_{F\setminus F^\prime}(n-1) \leq \sum_{k=0}^{\small\VC(F\setminus F^\prime)} \begin{pmatrix}n-1\\k\end{pmatrix} \leq \sum_{k=0}^{\small\VC(\F)-1} \begin{pmatrix}n-1\\k\end{pmatrix}$$ Finally, we have \begin{align*} \Pi_{\F}(n) = |F| &= |F^\prime| + |F\setminus F^\prime| \\ &\leq \Pi_{F^\prime}(n-1) + \Pi_{F\setminus F^\prime}(n-1) \\ &\leq \sum_{k=0}^{\small\VC(\F)} \begin{pmatrix}n-1\\k\end{pmatrix} + \sum_{k=0}^{\small\VC(\F)-1} \begin{pmatrix}n-1\\k\end{pmatrix} \\ &\leq \sum_{k=0}^{\small\VC(\F)} \begin{pmatrix}n-1\\k\end{pmatrix} + \sum_{k=1}^{\small\VC(\F)} \begin{pmatrix}n-1\\k-1\end{pmatrix} = \sum_{k=0}^{\small\VC(\F)} \begin{pmatrix}n\\k\end{pmatrix} \end{align*} where we used the identity $$\begin{pmatrix}n\\k\end{pmatrix} = \begin{pmatrix}n-1\\k\end{pmatrix} + \begin{pmatrix}n-1\\k-1\end{pmatrix} .$$ This identity characterizes the number of ways we can choose $k$ items from $n$ by computing it in two steps. In the first step, we leave the first item aside and count the number of ways we can choose $k$ items from $n-1$. In the second step, we start by choosing the first item and count the number of ways we can choose $k-1$ items from the $n-1$ remaining ones. The number of ways we can choose $k$ items from $n$ is the sum of the numbers obtained in the two steps.
5. Polynomial bound: For any $n > \VC(\F)$, the combinatorial result below leads to $$\sum_{k=0}^{\small\VC(\F)} \begin{pmatrix}n\\k\end{pmatrix} \leq \left( \frac{en}{\VC(\F)} \right)^{\small\VC(\F)} .$$

### Combinatorial result

A useful bound for sums of binomial coefficients is the following.

For $d < n$, $$\sum_{k=0}^d \begin{pmatrix}n\\k\end{pmatrix} \leq \left( \frac{en}{d} \right)^{d}$$

Proof: With $d < n$, we have $n/d > 1$ and, for all $i\leq d$, $(n/d)^{d-i} \geq 1$. Thus, \begin{align} \sum_{k=0}^d \begin{pmatrix}n\\k\end{pmatrix} & \leq \sum_{k=0}^d \begin{pmatrix}n\\k\end{pmatrix} \left(\frac{n}{d}\right)^{d-i} \\ & = \left(\frac{n}{d}\right)^{d} \sum_{k=0}^d \begin{pmatrix}n\\k\end{pmatrix} \left(\frac{d}{n}\right)^{i} \\ &\leq \left(\frac{n}{d}\right)^{d} \sum_{k=0}^n \begin{pmatrix}n\\k\end{pmatrix} \left(\frac{d}{n}\right)^{i} \\ &= \left(\frac{n}{d}\right)^{d} \left(1 + \frac{d}{n}\right)^{n} & \mbox{(due to the binomial theorem)}\\ & \leq \left(\frac{n}{d}\right)^{d} \left(e^{d/n} \right)^n & \mbox{(since } (1+x) \leq e^x)\\ & = \left(\frac{n}{d}\right)^{d} e^d \\ & = \left( \frac{en}{d} \right)^{d} \end{align}

Sauer's lemma was independently derived in , and .