My first VC bound

In words...

The symmetrization Lemma bounds the probability of having a large error between the empirical and expected risks via the probability of having a large error between the empirical risk computed on the training sample and the one computed on a ghost sample. Then, an error bound can be obtained by bounding the latter via the growth function. Such bounds are called Vapnik-Chervonenkis bounds or VC bounds.

The bound obtained here can be difficult to use due to the dependence of the growth function on the training sample size. However the bound can be reformulated in terms of the VC-dimension, which is independent of the sample size.

In pictures...

In maths...

In a binary classification setting with a given function class $\F$, we have the following bound on the risk $R(f)$ of any classifier $f\in\F$.

For all $\delta > 0$, the bound $$ \forall f\in\F,\quad R(f) \leq R_{emp}(f) + 2\sqrt{2\frac{\log \Pi_{\F}(2N) + \log\frac{2}{\delta}}{N}} $$ holds with probability at least $1-\delta$.

Proof: Introducing a ghost sample $D^\prime$ and applying the symmetrization Lemma yields $$ P^N\left\{ \sup_{f\in\F} ( R(f) - R_{emp}(f) \geq \epsilon\right\} \leq 2 P^{2N}\left\{ \sup_{f\in\F} ( R_{emp}^\prime(f) - R_{emp}(f) )\geq \frac{\epsilon}{2} \right\} , $$ with $N\epsilon^2 \geq 2$. Note that the quantity $$ R_{emp}^\prime(f) - R_{emp}(f) = \frac{1}{N}\sum_{i=1}^N \I{f(X_i^\prime)\neq Y_i^\prime} - \frac{1}{N}\sum_{i=1}^N \I{f(X_i)\neq Y_i} = \frac{1}{N}\sum_{i=1}^N ( \I{f(X_i^\prime)\neq Y_i^\prime} - \I{f(X_i)\neq Y_i} ) $$ only depends on the values $\mathbf{f}=(f_1,\dots,f_N, f_1^\prime,\dots,f_N^\prime)$ taken by the function $f$ on the training sample $D$ and the ghost sample $D^\prime$. This will help us to quantify the probabiltiy above in terms of the projection $\F_{DD^\prime}$ of $\F$ onto the concatenation $DD^\prime$ of the two samples and the growth function.

Formally, we use another kind of symmetrization by introducing random signs $\sigma_i\in\{-1,+1\}$ with $P(\sigma_i=1)= P( \sigma_i=-1) = 1/2$ that do not alter the distribution of the $\Delta_i(f) = \I{f(X_i^\prime)\neq Y_i^\prime} - \I{f(X_i)\neq Y_i}$. To see this, first note that $\Delta_i(f) \in\{-1,+,+1\}$ is symmetric: \begin{align*} P(\Delta_i(f) = 1) &=P(f(X_i')\neq Y_i', f(X_i)=Y_i) \\ &= P(f(X_i')\neq Y_i') P( f(X_i)=Y_i) \qquad\qquad because\ (X_i',Y_i')\ is\ indep.\ of\ (X_i,Y_i)\\ &= P(f(X_i)\neq Y_i) P( f(X_i')=Y_i')\qquad\qquad because\ (X_i',Y_i')\ and\ (X_i,Y_i)\ are\ identically\ distributed\\ &=P(f(X_i)\neq Y_i, f(X_i')=Y_i') \\ &= P(\Delta_i(f)=-1) \end{align*} Thus, $$ P(\sigma_i \Delta_i(f) = 1) = \begin{cases} P(\Delta_i(f) = 1) ,&\text{if } \sigma_i=1\\ P(\Delta_i(f) = -1) ,&\text{if } \sigma_i=-1\\ \end{cases} = P(\Delta_i(f) = 1)\ \ in\ all\ cases $$ and, similarly, $P(\sigma_i \Delta_i(f) = -1) = P(\Delta_i(f) = -1)$, while $P(\sigma_i \Delta_i(f) = 0) = P(\Delta_i(f) =0)$. Therefore, $\Delta_i(f)$ and $\sigma_i\Delta_i(f)$ share the same distribution and we can rewrite the probability of interest as \begin{align*} P^{2N} \left\{ \sup_{f\in\F} (R_{emp}^\prime(f) - R_{emp}(f))\geq \frac{\epsilon}{2}\right\} &= P^{2N} \left\{ \sup_{f\in\F}\frac{1}{N}\sum_{i=1}^N \Delta_i(f)\geq \frac{\epsilon}{2} \right\} = P \left\{ \sup_{f\in\F}\frac{1}{N}\sum_{i=1}^N \sigma_i \Delta_i(f)\geq \frac{\epsilon}{2} \right\}, \end{align*} where the last probability is now over $DD'$ and all the $\sigma_i$'s.

Using the law of total expectation and conditioning on $DD'$, the right-hand can be written as $$ P \left\{ \sup_{f\in\F}\frac{1}{N}\sum_{i=1}^N \sigma_i \Delta_i(f)\geq \frac{\epsilon}{2} \right\} = \E_{DD'} P\left\{\left. \sup_{f\in\F}\frac{1}{N}\sum_{i=1}^N \sigma_i \Delta_i(f)\geq \frac{\epsilon}{2} \right| DD' \right\} $$ Introduce now the notation $\Delta_i(\mathbf{f}) = \I{\g f_{N+i}\neq Y_i'}- \I{\g f_i\neq Y_i}$ for $\g f=(f(X_1),\dots f(X_N), f(X_1'),\dots, f(X_N') )$, this can be reshaped to $$ \E_{DD'} P \left\{\left. \exists \mathbf{f}\in \F_{DD^\prime} \ :\ \frac{1}{N}\sum_{i=1}^N \sigma_i \Delta_i(f)\geq \frac{\epsilon}{2} \right| DD' \right\} , $$ where, by the union bound, $$ P \left\{\left. \exists \mathbf{f}\in \F_{DD^\prime} \ :\ \frac{1}{N}\sum_{i=1}^N \sigma_i \Delta_i(f)\geq \frac{\epsilon}{2} \right| DD' \right\} \leq \sum_{\mathbf{f}\in \F_{DD^\prime}} P \left\{\left. \frac{1}{N}\sum_{i=1}^N \sigma_i \Delta_i(f)\geq \frac{\epsilon}{2} \right| DD' \right\} . $$ Calling now upon Hoeffding's inequality applied to the bounded random variables $Z_i = \sigma_i \Delta_i(\g f) \in [-1,1]$ with mean $\E [Z_i | DD'] = \Delta_i(\g f) \E \sigma_i = 0$, each term in the sum above can be bounded by $$ P \left\{\left. \frac{1}{N}\sum_{i=1}^N \sigma_i \Delta_i(f)\geq \frac{\epsilon}{2} \right| DD' \right\} \leq \exp\left(\frac{-2N (\epsilon/2)^2}{ 2^2}\right) = \exp\left(\frac{-N \epsilon^2}{ 8}\right). $$ Gathering all the results, and using the definition of the growth function, we have $$ P^{2N}\left\{ \sup_{f\in\F} ( R_{emp}^\prime(f) - R_{emp}(f) )\geq \frac{\epsilon}{2} \right\} \leq \E_{DD'} | \F_{DD^\prime}| \exp\left(\frac{-N \epsilon^2}{ 8}\right) \leq \sup_{DD'} | \F_{DD^\prime}| \exp\left(\frac{-N \epsilon^2}{ 8}\right) = \Pi_{\F}(2N)\exp\left(\frac{-N \epsilon^2}{ 8}\right). $$ and thus $$ P^N\left\{ \sup_{f\in\F} ( R(f) - R_{emp}(f) \geq \epsilon\right\} \leq 2\Pi_{\F}(2N)\exp\left(\frac{-N \epsilon^2}{ 8}\right). $$

Now, let us define $$ \delta = 2 \Pi_{\F}(2N)\exp \left(\frac{-N\epsilon^2}{8}\right) . $$ This leads to $\log \frac{\delta}{2\Pi_{\F}(2N)} = \frac{-N\epsilon^2}{8}$ and $$ \epsilon = \sqrt{8\frac{\log \frac{2\Pi_{\F}(2N)}{\delta}}{N}} = 2 \sqrt{2\frac{\log \Pi_{\F}(2N) + \log \frac{2}{\delta}}{N}}. $$ Introducing this value of $\epsilon$ in the probability above implies that $$ P\left\{ \sup_{f\in\F} ( R(f) - R_{emp}(f) ) \geq 2 \sqrt{2\frac{\log \Pi_{\F}(2N) + \log \frac{2}{\delta}}{N}} \right\} \leq \delta , $$ which corresponds to the desired result by the probability inversion principle and the fact that for all $f\in\F$, $$ R(f) \leq R_{emp}(f) + \sup_{f\in\F} ( R(f) - R_{emp}(f) ). $$

To conclude, we have to check the condition $N\epsilon^2 \geq 2$ that allowed us to apply the symmetrization Lemma. This condition boils down to a condition on $\delta$: $$ \delta \leq 2 \Pi_{\F}(2N)\exp \left(\frac{-1}{4}\right) $$ which trivially holds for $\delta\leq 1$ due to $$ 2 \Pi_{\F}(2N)\exp \left(\frac{-1}{4}\right) \geq \Pi_{\F}(2N) \geq 1 $$ (unless $\F$ is empty).