Symmetrization (where we start seeing ghost samples)

In words...

The symmetrization Lemma is one of the major tools needed to derive VC error bounds. It is based on the introduction of a ghost sample, i.e., an additional but virtual sample of data. With this ghost sample, we can bound the probability of having a large error between the empirical and expected risks via the probability of having a large error between the empirical risk computed on the training sample and the one computed on the ghost sample.

Therefore, after applying the symmetrization Lemma, one is left with the task of bounding a quantity that only depends on the output of classifiers on two samples, i.e., for which capacity measures based on the projection of the function class on the sample can be used.

In pictures...

In maths...

Consider a ghost sample $$ D^\prime = \{ (X_i^\prime, Y_i^\prime)\}_{i=1}^N $$ of $N$ independent and identically distributed copies of $(X,Y)$. Then, the symmetrization Lemma states that

For $N\epsilon^2 \geq 2$, $$ P^N \left\{ \sup_{f\in\F} |R_{emp}(f) - R(f)| \geq \epsilon \right\} \leq 2 P^{2N} \left\{ \sup_{f\in\F} |R_{emp}(f) - R_{emp}^\prime(f)|\geq \frac{\epsilon}{2} \right\} , $$ and $$ P^N \left\{ \sup_{f\in\F} (R(f) - R_{emp}(f) ) \geq \epsilon \right\} \leq 2 P^{2N} \left\{ \sup_{f\in\F} ( R_{emp}^\prime(f) - R_{emp}(f) )\geq \frac{\epsilon}{2} \right\} , $$ where $R_{emp}^\prime(f) = \frac{1}{N} \sum_{i=1}^N \I{f(X_i^\prime) \neq Y_i^\prime}$ is the emprical risk computed on the ghost sample $D^\prime$.

Here, $P^N$ refers to the product probability over the $N$ random pairs of the training sample $D = \{(X_i, Y_i)\}_{i=1}^N$, while $P^{2N}$ refers to the product probability over the $2N$ random pairs of the sample $DD^\prime$ obtained by concatenation of the training and ghost samples.

The proof relies on conditioning with respect to the training sample $D$ and on the concentration of the error on the ghost sample arount the true risk.

Proof of the first inequality (with absolute values): Let $f^*$ denote the function for which the supremum is obtained (and theat depends solely on $D$: $$ |R_{emp}(f^*) - R(f^*)| = \sup_{f\in\F} |R_{emp}(f) - R(f)| . $$ This allows us to write the quantity of interest as $$ P^N \left\{ \sup_{f\in\F} |R_{emp}(f) - R(f)| \geq \epsilon \right\} = P^N \left\{ |R_{emp}(f^*) - R(f^*)| \geq \epsilon \right\} . $$ Given that $f^*$ is a particular function of $\F$, we also have $\sup_{f\in\F} |R_{emp}(f) - R_{emp}^\prime(f)|\geq |R_{emp}(f^*) - R_{emp}^\prime(f^*)|$, which implies $$ P^{2N} \left\{ \sup_{f\in\F} |R_{emp}(f) - R_{emp}^\prime(f)|\geq \frac{\epsilon}{2} \right\} \geq P^{2N} \left\{ |R_{emp}(f^*) - R_{emp}^\prime(f^*)|\geq \frac{\epsilon}{2} \right\} . $$ To bound the right-hand side of this inequality, we use the fact that, for any $(a,b,c)\in\R^3$, $P(|a-b| \geq \epsilon/2) \geq P(|a-c|\geq \epsilon, |b-c|\leq \epsilon/2)$, since $\{|a-c|\geq \epsilon$ and $|b-c|\leq \epsilon/2\}$ characterizes one of the particular cases in which $|a-b| \geq \epsilon/2$.
With $a=R_{emp}(f^*)$, $b=R_{emp}^\prime(f^*)$ and $c=R(f^*)$, this leads to $$ P^{2N} \left\{ |R_{emp}(f^*) - R_{emp}^\prime(f^*)|\geq \frac{\epsilon}{2} \right\} \geq P^{2N} \left\{ |R_{emp}(f^*) - R(f^*)|\geq \epsilon \ ,\ |R_{emp}^\prime(f^*) - R(f^*)|\leq \frac{\epsilon}{2}\right\} $$ Using the properties of the indicator function, i.e., $\E \I{A} = P(A)$ and $\I{A\ and\ B} = \I{A}\I{B}$, we get $$ P^{2N} \left\{ |R_{emp}(f^*) - R_{emp}^\prime(f^*)|\geq \frac{\epsilon}{2} \right\} \geq \E \I{}\left\{ |R_{emp}(f^*) - R(f^*)|\geq \epsilon \right\}\ \I{}\left\{ |R_{emp}^\prime(f^*) - R(f^*)|\leq \frac{\epsilon}{2}\right\} $$ Then, using the law of total expectation, the right-hand side can be replaced by $$ \E\ \E\left[\left. \I{}\left\{ |R_{emp}(f^*) - R(f^*)|\geq \epsilon \right\}\ \I{}\left\{ |R_{emp}^\prime(f^*) - R(f^*)|\leq \frac{\epsilon}{2}\right\} \ \right| D \right] . $$ The event $|R_{emp}(f^*) - R(f^*)|\geq \epsilon$ only depends on the training sample $D$ and is thus fixed given $D$. Therefore, it can be extracted from the conditional expectation to yield \begin{align*} \E\ \I{}&\left\{ |R_{emp}(f^*) - R(f^*)|\geq \epsilon \right\}\ \E\left[\left. \I{}\left\{ |R_{emp}^\prime(f^*) - R(f^*)|\leq \frac{\epsilon}{2}\right\} \ \right| D \right] \\ &= \E\ \I{}\left\{ |R_{emp}(f^*) - R(f^*)|\geq \epsilon \right\}\ \left. P^N\left\{\left. |R_{emp}^\prime(f^*) - R(f^*)|\leq \frac{\epsilon}{2}\right\} \ \right| D \right\} , \end{align*} where the probability can be rewritten using the inversion principle as $$ P^N \left\{ \left. |R_{emp}^\prime(f^*) - R(f^*)|\leq \frac{\epsilon}{2} \right| D \right\} = 1 - P^N \left\{\left. |R_{emp}^\prime(f^*) - R(f^*)|> \frac{\epsilon}{2}\right| D \right\} $$ In addition, with Chebyshev's inequality and the bound on the variance of the bounded random variable $\I{f(X) \neq Y} \in[0,1]$, we have $$ P^N \left\{\left. |R_{emp}^\prime(f^*) - R(f^*)|> \frac{\epsilon}{2} \right| D\right\} \leq \frac{4 Var[ \I{f(X) \neq Y} | D ]}{N\epsilon^2} \leq \frac{1}{N\epsilon^2} \leq \frac{1}{2}, $$ where we used the assumption $N\epsilon^2 \geq 2$ for the last inequality.
Gathering these results, we have $$ P^N \left\{ \left. |R_{emp}^\prime(f^*) - R(f^*)|\leq \frac{\epsilon}{2} \right| D\right\} \geq 1 - \frac{1}{2} =\frac{1}{2} $$ and $$ P^{2N} \left\{ |R_{emp}(f^*) - R_{emp}^\prime(f^*)|\geq \frac{\epsilon}{2} \right\} \geq \frac{1}{2} \E\ \I{}\left\{ |R_{emp}(f^*) - R(f^*)|\geq \epsilon \right\} = \frac{1}{2} P^N \left\{ |R_{emp}(f^*) - R(f^*)|\geq \epsilon \right\} , $$ which leads to the desired result $$ P^{2N} \left\{ \sup_{f\in\F} |R_{emp}(f) - R_{emp}^\prime(f)|\geq \frac{\epsilon}{2} \right\} \geq \frac{1}{2} P^N \left\{ \sup_{f\in\F} |R_{emp}(f) - R(f)| \geq \epsilon \right\} . $$

Proof of the second inequality (without absolute values): We can reproduce the same steps to prove the result without absolute values with the particular choice of $f^*$ such that $$ R(f^*) - R_{emp}(f^*) = \sup_{f\in\F} ( R(f) - R_{emp}(f) ) . $$ However, we must pay attention to a few details. In particular, we now consider the difference $R_{emp}^\prime(f^*) - R_{emp}(f^*)$ instead of $|R_{emp}(f^*) - R_{emp}^\prime(f^*)|$ (note the reversed order of the terms). For instance, this means that we use the fact $$ (c-a)\geq\epsilon \ \wedge\ (b-c)\geq \frac{-\epsilon}{2}\quad \Rightarrow\quad b-a \geq \frac{\epsilon}{2}, $$ and thus that $P(b-a \geq \epsilon/2) \geq P(c-a\geq \epsilon,\ b-c\geq -\epsilon/2)$, with $a= R_{emp}(f^*)$, $b=R_{emp}^\prime(f^*)$ and $c=R(f^*)$ to obtain $$ P^{2N} \left\{ R_{emp}^\prime(f^*) - R_{emp}(f^*) \geq \frac{\epsilon}{2} \right\} \geq P^{2N} \left\{ R(f^*) - R_{emp}(f^*) \geq \epsilon ,\ R_{emp}^\prime(f^*) - R(f^*) \geq \frac{-\epsilon}{2}\right\} , $$ before conditioning on $D$. Then, to lower bound the remaining conditional probability by $1/2$, we need a few more computations: \begin{align*} P^{N}\left\{\left. R_{emp}^\prime(f^*) - R(f^*) \geq \frac{-\epsilon}{2} \right| D\right\} &= P^{N}\left\{\left. R_{emp}^\prime(f^*) - R(f^*) \in \left[ \frac{-\epsilon}{2}, \frac{\epsilon}{2}\right] \right| D\right\} + P^{N}\left\{\left. R_{emp}^\prime(f^*) - R(f^*) > \frac{\epsilon}{2}\right| D \right\} \\ &\geq P^{N}\left\{\left. | R_{emp}^\prime(f^*) - R(f^*) | \leq \frac{\epsilon}{2} \right| D\right\} \\ &\geq \frac{1}{2} \end{align*} where the last line is obtained via Chebyshev's inequality, as in the previous proof. The rest of the proof is similar.