Generalized Sauer's Lemma

In words...

Sauer's Lemma provides a polynomial bound on the growth function of a class of binary classifiers based on its VC-dimension.

This result can be generalized to apply to classes of functions taking values in a finite set. By a quantization argument, this can also be used for real-valued function classes, as in proving margin-based error bounds.

In pictures...

In maths...

Given a function class $\F$ of functions from $\X \rightarrow \{1,\dots,m\}$ with fat-shattering dimension at unit scale no greater than $d$, $\fat_{1}(\F) \leq d$, the uniform packing number at scale $\epsilon=2$ of $\F$ with respect to the $\ell_{\infty}$ pseudo-metric is bounded by $$ \mathcal{M}_{\infty}(2,\F,n) < 2(n m^2)^{\lceil \log_2 p \rceil} $$ with $$ p = \sum_{k=0}^{d} \begin{pmatrix}n\\k\end{pmatrix} m^k $$ or $$ \mathcal{M}_{\infty}(2,\F,n) \leq 2(n m^2)^{d \log_2 \left(\frac{e nm}{d} \right) } $$

Proof:

Step 0: $m\leq 2 \Rightarrow \mathcal{M}_{\infty}(2,\F,n) = 1 < 2(n m^2)^{\lceil \log_2 p\rceil}$

For $m\leq 2$, the $\ell_{\infty}$ distance between any two functions cannot be greater than 1 and no pair of functions can be separated, thus $\mathcal{M}_{\infty}(2,\F,n)=1$ and the result is obvious.
Step 1: For $m\geq 3$, $t(h,n) \geq p \Rightarrow \mathcal{M}_{\infty}(2,\F_{\g x_n},n) < h$

For $m\geq 3$, we first reason with a fixed $\g x_n \in \X^n$ and the $\ell_{\infty}$-pseudo metric $d_{\ell_{\infty}(\g x_n)}$. We also focus on the restriction $\F_{\g x_n}$ of $\F$ to the set of points $\g x_n$: $\F_{\g x_n} = \{ f_{\g x_n} : \g x_n \rightarrow B : f_{\g x_n}(x_i) = f(x_i),\ f\in\F \}$. Thus, the class $\F_{\g x_n}$ contains the same functions as $\F$ for which we only consider the output on a finite set of inputs $\g x_n$. Note that, since the $\ell_{\infty}$ pseudo-metric only takes into account the function value at these inputs, the transformation does not influence the packing numbers: $$ \mathcal{M}(2, \F_{\g x_n}, d_{\ell_{\infty}(\g x_n)}) = \mathcal{M}(2, \F, d_{\ell_{\infty}(\g x_n)}). $$ For all integers $h\geq 1$ and $n\geq 1$, let $t(h,n)$ denote the maximum number $t$ such that for every $2$-separated set $F \subseteq\F_{\g x_n}$ of $h=|F|$ functions, there are at least $t$ pairs $(X,\g r)$ of nonempty sets $X\subseteq \g x_n$ and witness $\g r$ 1-shattered by $F$, i.e., such that $$ \forall \g y \in \{-1,+1\}^{|X|},\quad \exists f\in F, \mbox{ such that } y_i( f(x_i) - r_i) \geq 1,\ i=1,\dots,|X|, $$ If no set $F$ of $h$ functions can be $2$-separated, then $t(h,n)=\infty$.

The number of sets $X\subseteq \g x_n$ with cardinality $|X| = k$ is $\binom{n}{k}$ and for these, there are strictly less than $m^k$ possible choices for $\g r \in B^k$ to satisfy the fat-shattering above.

The number of different $\g r\in B^k$ is $m^k$. However, $r_i=1$ cannot be chosen, since in this case no function with value in $B$ can be such that $f(x_i) \leq r_i - 1$ and the fat-shattering cannot be satisfied for $y_i=-1$. Similarly, $r_i$ cannot be equal to $m$. Therefore, in order to satisfy the fat-shattering, we must choose $\g r$ in $B\setminus \{1,m\}$, thus with $(m-2)^k < m^k$ possible values.
Therefore, the number of $(X,\g r)$ pairs with $|X|\leq d$ is less than $p=\sum_{k=1}^d \binom{n}{k} m^k$. And since $F\subseteq \F_{\g x_n} \Rightarrow \ \fat_{1}(F) \leq \fat_{1}(\F) \leq d$, there is no such pair with $|X| > d$. Hence, $t(h,n) < p$ or $t(h,n)=\infty$.
As a consequence, if $t(h,n) \geq p$ for some $h$, then $t(h,n)=\infty$ and there is no set of $h$ functions that are $2$-separated and $\mathcal{M}_{\infty}(2,\F_{\g x_n},n) < h$. Therefore, it remains to prove that $t(2(n m^2)^{\log_2 p}, n) \geq p $ for all $d\geq 1$ and $n\geq 1$.
Step 2: $t(2,n) \geq 1$ and, for $K>1$, $t( 2 K n m^2, n) \geq 2 t(2K, n-1)$

For any set $F=\{f_1,f_2\}$ of two 2-separated functions, there is an $x\in\g x_n$ such that $|f_1(x) - f_2(x)|\geq 2$. By taking $r=\lfloor (f_1(x) + f_2(x)) / 2 \rfloor$ and assuming that $f_1(x) > f_2(x)$ (otherwise simply swap the functions), we see that $f_1(x) \geq r + 1$ and $f_2(x) \leq r - 1$, and thus that $F$ 1-shatters the set $X=\{x\}$ with a single point. This implies that $t(2,n) \geq 1$ for all $n\geq 1$.
For $h= 2 K n m^2 > 2$, if there is no $2$-separated set of $h$ functions, then $t(h,n)=\infty \geq 2t(2K, n-1)$. If there is such a $2$-separated set $F$, then we split it into $Knm^2$ pairs $(f_k,g_k)$. For each $k$, we select an $x_i$ among the $n$ points in $\g x_n$ such that $|f_k(x_i) - g_k(x_i) | \geq 2$. By the pigeonhole principle, the same $x_i$ must be selected for at least $Km^2$ pairs. For this set of pairs, the number of possible unordered function value pairs $\{f_k(x_i),g_k(x_i)\} \in B^2$ is less than $\binom{m}{2} = m(m-1)/2$. Thus, by the pigeonhole principle again, at least $$ \frac{Km^2}{\binom{m}{2}} = \frac{2Km^2 }{ m(m-1)} = \frac{2Km}{m-1} > 2K $$ of these pairs lead to the same pair of function values, say $\{a,b\}$, on $x_i$.
Assume without loss of generality that $a < b$ and divide the functions from these more than $2K$ pairs into the set $\overline{F}_1\subset F$ of functions with value $a$ at $x_i$ and the set $\overline{F}_2\subset F$ of functions with value $b$ at $x_i$. Since the functions in $\overline{F}_1$ and $\overline{F}_2$ go by pairs, we have $|\overline{F}_1|=|\overline{F}_2|>2K$. In fact, we can subsample $\overline{F}_1$ and $\overline{F}_2$ to get $F_1\subset \overline{F}_1$ and $F_2\subset \overline{F}_2$ with $|F_1|=|F_2|=2K$. Since $F$ is $2$-separated with respect to $d_{\ell_{\infty}(\g x_n)}$, any subset of $F$ is also $2$-separated with respect to $d_{\ell_{\infty}(\g x_n)}$. In addition, for all $(f,g)\in F_1^2$, $f(x_i) = g(x_i) = a$, and thus the functions must be separated on some other point, which implies that $d_{\ell_{\infty}(\g x_n \setminus \{ x_i\} )}(f,g) \geq 2$ and that $F_1$ is $2$-separated with respect to $d_{\ell_{\infty}(\g x_n \setminus \{ x_i\})}$. A similar result holds for $F_2$.
By definition of the function $t$, any $2$-separated set of cardinality $2K$, and in particular $F_1$ and $F_2$, must $1$-shatter at least $t(2K, n-1)$ pairs $(X,\g r)$ with $X\subseteq \g x_n\setminus \{ x_i\}$. Since $F_1$ and $F_2$ are subsets of $F$, all the pairs shattered by $F_1$ or $F_2$ are also shattered by $F$. In addition, if a pair $(X,\g r)$ is shattered by both $F_1$ and $F_2$, $F$ shatters the pair $( (X, x_i), (\g r, \lfloor a+b/2\rfloor ) )$. Thus, the number of pairs shattered by $F$ is at least $2 t(2K, n-1)$ and $t( |F|, n)$ is also at least this number. Since $|F|=h= 2 K n m^2$, we obtain $$ t( 2 K n m^2, n) \geq 2 t(2K, n-1). $$
Step 3: recursion yields $t(2(n m^2)^{\lceil \log_2 p \rceil}, n) \geq p$

For $0\leq j\leq k < n$, let $K_j = \prod_{l=n-k+1}^{n- j-1} lm^2$, $n_j = n-j$ and $h = 2(nm^2)( ( n-1)m^2 ) ... ( ( n-k+1) m^2) = 2 K_0 n_0 m^2$. Then, for all $j$ such that $0 \leq j \leq k$, the result of Step 2 leads to $$ t( 2 K_j n_j m^2, n_j) \geq 2 t(2K_j, n_j-1) = 2 t(2K_{j+1}(n-j-1)m^2 , n_{j+1} ) = 2 t(2K_{j+1}n_{j+1} m^2 , n_{j+1} ) . $$ Therefore, $t_j = t( 2 K_j n_j m^2, n_j)\geq 2 t_{j+1}$ and, by recursion, $$ t(h,n) = t_0 \geq 2^k t_k \geq 2^k t(2 ( n-k+1) m^2 , n-k) \geq 2^k t(2, n-k) \geq 2^k , $$ where we used the fact that $t$ is a nondecreasing function of its first argument and $t(2, n-k) \geq 1$ from Step 2.
In addition, the nondecreasing nature of $t$ with the fact that $2(nm^2)^k \geq h$ yields $$ t(2(n m^2)^k, n) \geq t(h,n) \geq 2^k. $$

Let $k = \lceil \log_2 p \rceil$. Then, if $k < n$, the above gives $$ t(2(n m^2)^{\lceil \log_2 p \rceil}, n) \geq 2^{\lceil \log_2 p \rceil } \geq 2^{\log_2 p} = p , $$ while if $k\geq n$, $m^{2k} \geq m^n$ and $2(nm^2)^k > m^n$. But since the total number of functions from $\g x_n$ to $B$ is $m^n$, there is no $2$-separated set of $2(nm^2)^k$ functions and $t(2(nm^2)^k, n) = \infty \geq p$.
Final Step:

Finally, we use Step 1 with $h =2(n m^2)^{\lceil \log_2 p \rceil}$ to obtain that $$ \mathcal{M}(2, \F, d_{\ell_{\infty}(\g x_n)}) = \mathcal{M}(2, \F_{\g x_n}, d_{\ell_{\infty}(\g x_n)}) < 2(n m^2)^{\lceil \log_2 p \rceil} . $$ Since this holds for any set $\g x_n \in \X^n$, we also have $$ \mathcal{M}_{\infty}(2, \F, n) = \max_{\g x_n \in \X^n} \mathcal{M}(2, \F, d_{\ell_{\infty}(\g x_n)}) < 2(n m^2)^{\lceil \log_2 p \rceil} . $$

Pigeonhole principle

The pigeonhole principle states that when dividing a set of $n$ objects into $m$ subsets with $n> k m$, at least one subset must contain more than $k$ objects.

Notes

The original proof of this lemma can be found in .