Rademacher averages

In words...

The Rademacher average of a class of functions is a capacity measure that can be interpreted as the ability of the functions in the class to estimate random noise.

Given a function class, Rademacher averages provide sharp estimates of the maximal deviation of the empircal average from the mean of a function in the class and are used when deriving error bounds.

In pictures...

...

In maths...

A Rademacher random variable $\sigma$ is a random variable taking values in $\{-1,+1\}$ with probabilities $$ P(\sigma = 1) = P(\sigma = -1) = 1/2. $$

Given a function class $\F$ and a sequence $\g X_n = (X_i)_{1\leq i\leq n}$ of $n$ independent and identically distributed random variables, the Rademacher average of $\F$ is defined as $$ \mathcal{R}_n(\F) = \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) , $$ where $\g \sigma_n = (\sigma_i)_{1\leq i\leq n}$ is a sequence of independent Rademacher random variables that are also independent of $\g X_n$.

In the definition above, the expectation is taken over both the Rademacher variables $\g \sigma_n$ and the sequence $\g X_n$. The empirical (or conditional) Rademacher average of $\F$, $\hat{\mathcal{R}}_n(\F)$, is defined similarly except that the sequence $\g X_n$ is fixed and the expectation is taken only over the Rademacher variables: $$ \hat{\mathcal{R}}_n(\F) = \E_{\g \sigma_n} \left[ \left. \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i )\ \right|\ \g X_n \right] . $$

By the law of total expectation, the two quantities defined above are directly related by $$ \mathcal{R}_n(\F) = \E_{\g X_n} \hat{\mathcal{R}}_n(\F) . $$

Properties of Rademacher averages

Rademacher averages of a finte class $\F = \{f_1,\dots,f_N\}$ can be bounded by ...

Rademacher averages are invariant under translation, i.e, for any scalar $a\in\R$, $$ \mathcal{R}_n(a+\F) = \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i (a+f(X_i )) = \mathcal{R}_n(\F). $$

Proof: We have \begin{align} \mathcal{R}_n(a+\F) &= \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i (a+f(X_i )) \\ &= \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \left( \frac{1}{n} \sum_{i=1}^n \sigma_i a+ \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) \right) \\ &= \E_{\g X_n, \g \sigma_n}\left[ \frac{1}{n} \sum_{i=1}^n \sigma_i a + \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) \right] & (\sup_b (a+b) = a+\sup_b b )\\ &= \frac{a}{n} \sum_{i=1}^n \E_{\sigma_i} \sigma_i + \E_{\g X_n, \g \sigma_n}\sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) & (\mbox{by the linearity of the expectation}) \\ &= 0 + \mathcal{R}_n(\F) & (\mbox{the } \sigma_i \mbox{ are centered }). \end{align}

Contraction principle

The contraction principle provides an upper bound on the Rademacher average of a function class composed with another function.

If $\phi: \R\rightarrow \R$ is a Lipschitz continuous function with Lipschitz constant $L_{\phi}$, i.e., if $$ \forall (t,s)\in \R^2,\quad |\phi(t) - \phi(s)| \leq L_{\phi} |t-s|, $$ then $$ \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i \phi \circ f(X_i ) \ \leq \ L_{\phi}\ \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) $$ and thus $$ \mathcal{R}_n ( \phi \circ \F ) \leq L_{\phi} \mathcal{R}_n ( \F ) , $$ where $\phi \circ \F$ denotes the class of functions $\phi \circ f$ with $f \in\F$.

The version of the contraction principle shown here is actually not restricted to contractions, meaning that the result still holds if $\phi$ has a Lipschitz constant larger than 1.

We will show that, for any $T\subset \R^n$ and any sequence of functions $\phi_i$ with Lipschitz constants $L_{\phi_i}$, $$ \E_{\g \sigma_n} \sup_{t \in T } \sum_{i=1}^n \sigma_i \phi_i( t_i) \leq \E_{\g \sigma_n} \sup_{t \in T } \sum_{i=1}^n L_{\phi_i}\sigma_i t_i $$ from which the statement follows by letting $T = \{ \g t \in \R^n \ :\ t_i = f(X_i) ,\ X_i \in\X \}$, $\phi_i = \phi$, taking expectations with respect to $\g X_n$ and dividing both sides by $n$.

Step 1: To prove the above, we first concentrate on showing that with only two variables, for any $T_2\subset \R^2$, $$ \E_{\sigma_2} \sup_{t \in T_2 \subset \R^2 } ( t_1 + \sigma_2 \phi(t_2) ) \leq \E_{\sigma_2} \sup_{t \in T_2\subset \R^2 } ( t_1 + L_{\phi_2} \sigma_2 t_2), $$ the right-hand side of which can be rewritten as \begin{align} R &= \frac{1}{2} \sup_{t \in T_2 } ( t_1 + L_{\phi_2} t_2) + \frac{1}{2} \sup_{t \in T_2 } ( t_1 - L_{\phi_2}t_2) \\ &= \frac{1}{2} \sup_{t \in T_2, s\in T_2 } ( t_1 + L_{\phi_2}t_2 + s_1 - L_{\phi_2} s_2) \end{align} while the left-hand side is \begin{align} L &= \frac{1}{2} \sup_{t \in T_2} ( t_1 + \phi(t_2)) + \frac{1}{2} \sup_{t \in T_2 } ( t_1 - \phi(t_2)) \\ &= \frac{1}{2} \sup_{t \in T_2, s\in T_2 } ( t_1 + \phi(t_2) + s_1 - \phi(s_2) )\\ &= \sup_{t\in T_2, s\in T_2} I(t,s) \end{align} with $$ I(t,s) = \frac{1}{2}(t_1 + \phi(t_2) + s_1 - \phi(s_2) ) . $$ These computations show that if $R \geq I(t,s)$ for all $(t,s)$, then $R \geq L$.

Step 1 - case 1: Assume $t_2 \geq s_2$. Then, by contraction (or more precisely by the Lipschitz condition), we have $$ |\phi(t_2) - \phi(s_2)| \leq L_{\phi_2} |t_2 - s_2| = L_{\phi_2} (t_2 - s_2) $$ and thus $$ \phi(t_2) - \phi(s_2) \leq L_{\phi_2}t_2 - L_{\phi_2}s_2 $$ and $$ L_{\phi_2}s_2 - \phi(s_2) \leq L_{\phi_2}t_2 - \phi(t_2) . $$ Thus, \begin{align} 2I(t,s) &= t_1 + \phi(t_2) + s_1 - \phi(s_2)\\ &= t_1 + \phi(t_2) + s_1 - \phi(s_2) + L_{\phi_2}s_2 - L_{\phi_2}s_2 \\ &\leq t_1 + \phi(t_2) + s_1 + L_{\phi_2}t_2 - \phi(t_2) - L_{\phi_2}s_2 \\ &\leq t_1 + L_{\phi_2} t_2 + s_1 - L_{\phi_2}s_2 \end{align} and \begin{align} I(t,s) &\leq \frac{1}{2}\sup_{t\in T_2, s\in T_2 } (t_1 + L_{\phi_2}t_2 + s_1 - L_{\phi_2} s_2) = R. \end{align}

Step 1 - case 2: Assume $t_2 \leq s_2$. Then, by contraction, we have $|\phi(t_2) - \phi(s_2)| \leq L_{\phi_2}|t_2 - s_2| = L_{\phi_2}s_2 - L_{\phi_2}t_2$ and thus $$ \phi(t_2) - \phi(s_2) \leq L_{\phi_2}s_2 - L_{\phi_2}t_2 $$ and $$ L_{\phi_2}t_2 + \phi(t_2) \leq L_{\phi_2} s_2 + \phi(s_2) $$ Thus, \begin{align} 2I(t,s) &= t_1 + \phi(t_2) + s_1 - \phi(s_2) + L_{\phi_2}t_2 - L_{\phi_2}t_2 \\ &\leq t_1 - L_{\phi_2}t_2 + s_1 - \phi(s_2) + L_{\phi_2} s_2 + \phi(s_2) \\ &\leq t_1 - L_{\phi_2}t_2 + s_1 + L_{\phi_2} s_2 \end{align} and \begin{align} I(t,s) &\leq \frac{1}{2}\sup_{t\in T_2, s\in T_2 } (t_1 +L_{\phi_2} t_2 + s_1 - L_{\phi_2}s_2) =R. \end{align}

Step 1 - conclusion: For all $(t,s)\in T_2^2$, $I(t,s) \leq R$ and $L= \sup_{t\in T_2, s\in T_2} I(t,s) \leq R$.

Step 2: As an illustrative example, choose $$ T_2 = \left\{ (u_1, u_2) \in \R^2 : u_1 = \sum_{i\neq 2} \sigma_i t_i,\ u_2 = t_2,\ \g t \in T ,\ \sigma_i \in\{-1,+1\} \right\} $$ and apply the previous result while conditioning on all Rademacher variables but $\sigma_2$: $$ \E_{\sigma_2} \left[\left. \sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + \sigma_2 \phi_2(t_2) \ \right| \sigma_{i\neq 2}\right] \leq \E_{\sigma_2} \left[\left. \sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + L_{\phi_2}\sigma_2 t_2\ \right| \sigma_{i\neq 2} \right] $$ Thus, \begin{align} \E_{\g \sigma_n} \sup_{\g t \in T } \left( \sum_{i\neq 2} \sigma_i t_i + \sigma_2 \phi_2(t_2) \right) &= \E_{ \sigma_i, i\neq 2} \E_{\sigma_2} \left[\left. \sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + \sigma_2 \phi_2(t_2) \ \right| \sigma_{i\neq 2}\right] \\ & \leq\E_{ \sigma_i, i\neq 2} \E_{\sigma_2} \left[\left. \sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + L_{\phi_2}\sigma_2 t_2\ \right| \sigma_{i\neq 2}\right] \\ & \leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + L_{\phi_2}\sigma_2 t_2 \end{align} and we have shown that $\E_{\g \sigma_n} \sup_{\g t \in T } \sum_{i} \sigma_i t_i$ decreases when replacing $t_2$ by $\phi_2(t_2)$.
In fact, we can iterate from $j=1$ to $n$ with $$ T_j = \left\{ (u_1, u_2) \in \R^2 : u_1 = \sum_{i < j} \sigma_i \phi_i(t_i) ,\ u_2 = t_j,\ \g t\in T ,\ \sigma_i \in\{-1,+1\} \right\} $$ to show that \begin{align} \E_{\g \sigma_n} \sup_{\g t \in T } \sigma_1 \phi_1(t_1) &\leq \E_{\g \sigma_n}\sup_{t_1 \in T } L_{\phi_1}\sigma_1 t_1 \\ \vdots \\ \E_{\g \sigma_n} \sup_{\g t \in T } \sum_{i \leq j} \sigma_i \phi_i(t_i) = \E_{\g \sigma_n} \sup_{\g t \in T } \left( \sum_{i < j} \sigma_i \phi_i(t_i) + \sigma_j \phi_j(t_j) \right) &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i < j} \sigma_i \phi_i(t_i) + L_{\phi_j}\sigma_j t_j & \mbox{(apply result of Step 1 with } T_j)\\ &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i < j} \sigma_i \phi_i(t_i) + \E_{\g \sigma_n}\sup_{\g t \in T } L_{\phi_j}\sigma_j t_j & \mbox{(by the linearity of } \E) \\ &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i < j} L_{\phi_i}\sigma_i t_i + \E_{\g \sigma_n}\sup_{\g t \in T } L_{\phi_j}\sigma_j t_j & \mbox{(use previous results for } i < j) \\ &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i < j} L_{\phi_i}\sigma_i t_i + L_{\phi_j}\sigma_j t_j\\ &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i \leq j} L_{\phi_i}\sigma_i t_i\\ \vdots \\ \E_{\g \sigma_n} \sup_{\g t \in T } \sum_{i\leq n} \sigma_i \phi_i(t_i) &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i\leq n} L_{\phi_i}\sigma_i t_i . \end{align}

A trivial upper bound on the Rademacher average of bounded functions

Given a set of bounded functions, $\F\subset \X \rightarrow [a,b]$, of diameter $M = b-a$, let $\tilde{\F} = \F - \frac{a+b}{2}$ be the centered version of $\F$. Then, we have \begin{align} \mathcal{R}_n ( \F ) &= \mathcal{R}_n ( \tilde{\F} ) &\mbox{(by the translation invariance of } \mathcal{R}_n ) \\ &=\E_{\g X_n, \g \sigma_n} \sup_{f \in \tilde{\F} } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) \\ &= \E_{\g X_n, \g \sigma_n} \sup_{f \in \tilde{\F} } \frac{1}{n} \inner{\g \sigma_n}{\g f_n} & \left(\mbox{with } \g f_n = (f(X_i))_{1\leq i \leq n} \right) \\ &\leq \E_{\g X_n, \g \sigma_n} \sup_{f \in \tilde{\F} } \frac{1}{n} \|\g \sigma_n\| \| \g f_n \| &\mbox{(by the } \href{innerproduct.html#cauchy}{\mbox{Cauchy-Schwarz inequality}}) \\ &\leq \frac{1}{n} \E_{\g X_n, \g \sigma_n} \|\g \sigma_n\| \sup_{f \in \tilde{\F} } \| \g f_n \| \\ &\leq \frac{1}{n} \E_{\g X_n, \g \sigma_n} \sqrt{n} \sqrt{n \frac{M^2}{4}} & \mbox{(since } |\sigma_i| = 1 \mbox{ and } |f(X_i)| \leq M / 2) \\ &\leq \frac{M}{2} \end{align}

Notes

The contraction principle is shown here in its most useful form for machine learning. Its proof is adapted from the original and more general Theorem 4.12 in (though the original version focuses on contractions with $L_{\phi} \leq 1$ and does not include $L_{\phi}$ in the result).

The contraction principle is often stated in the machine learning literature with an additional requirement that $\phi(0)=0$, which is an artifact of the more general proof from that is unnecessary in proving the version stated above. In addition, such a requirement can be easily relaxed by using the translation invariant property of Rademacher averages with $a=\phi(0)$ and by applying the contraction principle to $\tilde{\phi} = \phi - a$: $$ \mathcal{R}_n(\phi\circ \F) = \mathcal{R}_n(a + \tilde{\phi}\circ \F) = \mathcal{R}_n(\tilde{\phi}\circ \F) \leq L_{\tilde{\phi}} \mathcal{R}_n ( \F ) = L_{\phi} \mathcal{R}_n ( \F ). $$

Finally, note that the proof of the contraction principle actually proves a more general result in which the function $\phi$ can be replaced by a sequence of functions $\phi_i$, that can be anything and possibly depend on $\g X_n$ as long as they are Lipschitz continuous.