Bounding deviations with Rademacher averages

In words...

Rademacher averages provide a powerful tool for deriving bounds on the error of estimating the mean of a bounded function by an empirical average. In particular, given a set of bounded functions of some random variable, the maximal deviation between the empircal average and true mean of the functions cannot be larger than twice the Rademacher average of the set of functions plus a term that goes to zero when the number of samples used to evaluate the empirical average increases.

The proof of this result is based on the bounded differences inequality, a symmetrization trick and the properties of Rademacher averages.

Using this general result, we can derive generalization error bounds for binary classification.

In pictures...

...

In maths...

The following result bounds the deviation between the empircal and true mean of a bounded function of a random variable for all functions of a given function class with the Rademacher average of the class.

Let $(X_i)_{1\leq i\leq n}$ be a sequence of indenpendent copies of the random variable $X$. Let $[a,b]\subset\R$ denote an interval of width $b-a=M$. Given a function class $\F \subset \X \rightarrow [a,b]$ with Rademacher average $\mathcal{R}_n(\F)$, the bound $$ \sup_{f\in\F}\left( \E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i)\right) \leq 2 \mathcal{R}_n(\F) + M\sqrt{\frac{\log \frac{1}{\delta}}{2 n}} $$ holds with probability at least $1-\delta$.

The proof is as follows.

First error bound via bounded differences: Consider the function $$ g(X_1,\dots,X_n) = \sup_{f\in\F} \left( \E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i) \right) $$ with a bounded function $f : \X \rightarrow [a,b]$ and note that when changing the value of a single variable $X_i$ to $X_i^\prime$, the value of $g$ cannot change by more than $M/n$.

Indeed, the value of the quantity $\tilde{g}(f,X_1,\dots,X_n) = \E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i)$ cannot change by more than $M/n$ as \begin{align} \tilde{g}(f,X_1,\dots,X_n) - \tilde{g}(f, X_1,\dots,X_i^\prime, \dots, X_n) &= \left(\E_{X} f(X) - \frac{1}{n} \sum_{j=1}^n f(X_j) \right)- \left( \E_{X} f(X) - \frac{f(X_i^\prime)}{n} - \frac{1}{n} \sum_{j\neq i}^n f(X_j)\right) \\ &= \frac{f(X_i^\prime) - f(X_i)}{n} \in \left[\frac{-M}{n}, \frac{M}{n}\right] . \end{align} To see the effect on $g$, let $f_0$ be the function at which the supremum is attained for a set of values of $(X_1,\dots,X_n)$: $g(X_1,\dots,X_n) = \E_{X} f_0(X) - \frac{1}{n} \sum_{i=1}^n f_0(X_i) = \tilde{g}(f_0,X_1,\dots,X_n)$ and $f_i$ be the function at which the supremum is attained when changing the value of $X_i$ to $X_i^\prime$: $g(X_1,\dots,X_i^\prime, \dots, X_n) =\E_{X} f_i(X) -\frac{f_i(X_i^\prime)}{n} - \frac{1}{n} \sum_{j\neq i}^n f_i(X_j) = \tilde{g}(f_i,X_1,\dots,X_i^\prime, \dots, X_n)$. Then, we have $$ \tilde{g}(f_0,X_1,\dots,X_n) \geq \tilde{g}(f_i,X_1,\dots,X_n) $$ and $$ \tilde{g}(f_0,X_1,\dots,X_i^\prime, \dots, X_n) \leq \tilde{g}(f_i,X_1,\dots,X_i^\prime, \dots, X_n) . $$ Thus, by the bounded differences of $\tilde{g}$, we also have $$ \tilde{g}(f_0,X_1,\dots,X_n) \leq \tilde{g}(f_0, X_1,\dots,X_i^\prime, \dots, X_n) + \frac{M}{n} \leq \tilde{g}(f_i,X_1,\dots,X_i^\prime, \dots, X_n) + \frac{M}{n} , $$ $$ \tilde{g}(f_0,X_1,\dots,X_n) \geq \tilde{g}(f_i,X_1,\dots,X_n) \geq \tilde{g}(f_i, X_1,\dots,X_i^\prime, \dots, X_n) - \frac{M}{n} $$ and finally $$ g(X_1,\dots,X_n) - g(X_1,\dots,X_i^\prime, \dots, X_n) = \tilde{g}(f_0,X_1,\dots,X_n) - \tilde{g}(f_i, X_1,\dots,X_i^\prime, \dots, X_n) \in \left[\frac{-M}{n}, \frac{M}{n}\right] . $$

Therefore, we can apply the bounded differences inequality to $g$ with constants $c_i = M/n$, $i=1,\dots,n$, and obtain that $$ P\left\{ g(X_1,\dots, X_n) - \E_{\g X_n} g(X_1,\dots, X_n) \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{M^2}\right), $$ or, by invoking the inversion principle with $\delta = \exp\left(\frac{-2n\epsilon^2}{M^2}\right)$, that $$ \sup_{f\in\F}\left( \E_{X} f(X) -\frac{1}{n} \sum_{i=1}^n f(X_i) \right) \leq \E_{\g X_n} \sup_{f\in\F} \left( \E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i) \right) + \sqrt{\frac{M^2 \log \frac{1}{\delta}}{2n}} $$ holds with probability at least $1-\delta$. Thus, in order to bound the supremum, we need to bound its expectation, i.e., the quantity $$ \E_{\g X_n} \sup_{f\in\F} \left( \E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i) \right). $$

Symmetrization step: Let $\g X_n = (X_i)_{1\leq i\leq n}$ be a sequence of $n$ independent and identically distributed random variables considered as copies of $X$ and introduce a ghost sample $\g X_n^\prime = (X_i^\prime)_{1\leq i\leq n}$ made of $n$ additional independent and identically distributed copies of $X$. Then, by the linearity of the expectation, we have $$ \E_{\g X_n^\prime} \left[ \frac{1}{n} \sum_{i=1}^n f(X_i^\prime) \right] = \frac{1}{n} \sum_{i=1}^n \E_{X_i^\prime} f(X_i^\prime) = \frac{1}{n} \sum_{i=1}^n \E_{X} f(X) = \E_{X} f(X) . $$ and thus \begin{align} \E_{\g X_n} \sup_{f\in\F} \left( \E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i) \right) &= \E_{\g X_n} \sup_{f\in\F} \left(\E_{\g X_n^\prime} \left[ \frac{1}{n} \sum_{i=1}^n f(X_i^\prime) \right] - \frac{1}{n} \sum_{i=1}^n f(X_i) \right) \\ &= \E_{\g X_n} \sup_{f\in\F} \left( \E_{\g X_n^\prime} \left[ \left. \frac{1}{n} \sum_{i=1}^n f(X_i^\prime) - \frac{1}{n} \sum_{i=1}^n f(X_i) \ \right| \ \g X_n \right] \right) & f(X_i) \mbox{ is constant wrt. } \g X_n^\prime\\ &= \E_{\g X_n} \sup_{f\in\F} \left( \E_{\g X_n^\prime} \left[ \left. \frac{1}{n} \sum_{i=1}^n ( f(X_i^\prime) - f(X_i)) \ \right| \ \g X_n \right] \right) \\ &\leq \E_{\g X_n}\E_{\g X_n^\prime} \left[ \left. \sup_{f\in\F} \left(\frac{1}{n} \sum_{i=1}^n ( f(X_i^\prime) - f(X_i)) \right) \ \right| \ \g X_n \right] & \mbox{(by } \href{jenseninequality.html}{\mbox{Jensen's inequality}}) \\ &\leq \E_{\g X_n \g X_n^\prime} \sup_{f\in\F} \left(\frac{1}{n} \sum_{i=1}^n ( f(X_i^\prime) - f(X_i)) \right) & \mbox{(by the }\href{totalprobability.html}{\mbox{law of total expectation}}) \end{align}

Introducing the Rademacher average: Next, we use the fact that $X_i$ and $X_i^\prime$ are independent and identically distributed, which implies that $D_i = f(X_i^\prime) - f(X_i)$ is a symmetric random variable with the same distribution as $-D_i = f(X_i) - f(X_i^\prime)$. Thus, the random variable $\sigma_i D_i = \pm D_i$ also shares the same distribution and we can replace $D_i$ by $\sigma_i D_i$ in the expectation above. In addition, a similar reasoning applies to the random variables $f(X_i)$ and $f(X_i^\prime)$ that are distributed as $f(X)$ and $\sigma_i f(X_i)$, $-\sigma_i f(X_i)$ and $\sigma_if(X_i^\prime)$. This leads to \begin{align} \E_{\g X_n \g X_n^\prime} & \sup_{f\in\F} \left(\frac{1}{n} \sum_{i=1}^n ( f(X_i^\prime) - f(X_i)) \right) \\ &= \E_{\g X_n \g X_n^\prime \g \sigma_n} \sup_{f\in\F} \left(\frac{1}{n} \sum_{i=1}^n \sigma_i (f(X_i^\prime) - f(X_i) )\right) \\ &\leq \E_{\g X_n \g X_n^\prime \g \sigma_n} \left[\sup_{f\in\F} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i^\prime) + \sup_{f\in\F} \frac{1}{n} \sum_{i=1}^n -\sigma_i f(X_i)\right] & (\sup_{a,b} (a+b) \leq \sup_a a + \sup_b b) \\ &\leq \E_{\g X_n \g X_n^\prime \g \sigma_n} \left[\sup_{f\in\F} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i^\prime) \right] + \E_{\g X_n \g X_n^\prime \g \sigma_n} \left[\sup_{f\in\F} \frac{1}{n} \sum_{i=1}^n -\sigma_i f(X_i) \right] & \mbox{(by the } \href{expectation.html}{\mbox{linearity of } \E} )\\ & \leq \E_{\g X_n \g \sigma_n} \left[\sup_{f\in\F} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i) \right] + \E_{\g X_n \g \sigma_n} \left[\sup_{f\in\F} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i)\right] & (\sigma_if(X_i^\prime), -\sigma_i f(X_i) \mbox{ are distributed as } \sigma_if(X_i)) \\ & \leq 2 \E_{\g X_n \g \sigma_n} \sup_{f\in\F} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i) = 2 \mathcal{R}_n(\F) \end{align}

Therefore, $$ \E_{\g X_n} \sup_{f\in\F} \left(\E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i) \right) \leq 2 \mathcal{R}_n(\F) $$ which, once injected in the bounded differences bound, yields $$ \sup_{f\in\F}\left( \E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i)\right) \leq 2 \mathcal{R}_n(\F) + \sqrt{\frac{M^2\log \frac{1}{\delta}}{2 n}} . $$

Data-dependent bound

The data-dependent bound $$ \sup_{f\in\F}\left( \E_{X} f(X)- \frac{1}{n} \sum_{i=1}^n f(X_i) \right) \leq 2 \hat{\mathcal{R}}_n(\F) + 3M\sqrt{\frac{\log \frac{2}{\delta}}{2n}} , $$ which involves the empirical Rademacher average, $ \mathcal{R}_n(\F)$, instead of the Rademacher average, also holds with probability $1-\delta$.

Proof: Applying the (second) bounded differences inequality again, but this time to $ \hat{\mathcal{R}}_n(\F)$ which satisfies the bounded differences assumption with constants $c_i=M/n$, yields that $$ \mathcal{R}_n(\F) = \E_{\g X_n} \hat{\mathcal{R}}_n(\F) \leq \hat{\mathcal{R}}_n(\F) + M\sqrt{\frac{\log \frac{1}{\delta_2}}{2n}}, $$ holds with probability at least $1-\delta_2$.

In order to combine the two bounds into a result that holds with probability at least $1-\delta$, we use $\delta_2=\delta/2$ and also $\delta_1=\delta/2$ in the bound derived in the previous section, such that $$ \sup_{f\in\F}\left( \E_{X} f(X) - \frac{1}{n} \sum_{i=1}^n f(X_i)\right) \leq 2 \mathcal{R}_n(\F) + M\sqrt{\frac{\log \frac{1}{\delta_1}}{2 n}} $$ holds with probability $1-\delta_1$. Then, their combination leads to \begin{align} \sup_{f\in\F}\left( \E_{X} f(X)- \frac{1}{n} \sum_{i=1}^n f(X_i) \right) &\leq 2 \hat{\mathcal{R}}_n(\F) + 2M\sqrt{\frac{\log \frac{2}{\delta}}{2n}} + M\sqrt{\frac{\log \frac{2}{\delta}}{2 n}} \\ &\leq 2 \hat{\mathcal{R}}_n(\F) + 3M\sqrt{\frac{\log \frac{2}{\delta}}{2 n}} \end{align} which holds with probability at least \begin{align} (1-\delta_1)(1-\delta_2) &= 1 - \delta_1 - \delta_2 + \delta_1\delta_2 \\ &= 1 - \frac{\delta}{2} -\frac{\delta}{2} + \left(\frac{\delta}{2}\right)^{2} \\ &\geq 1 - \delta . \end{align}