Processing math: 4%

Rademacher averages

In words...

The Rademacher average of a class of functions is a capacity measure that can be interpreted as the ability of the functions in the class to estimate random noise.

Given a function class, Rademacher averages provide sharp estimates of the maximal deviation of the empircal average from the mean of a function in the class and are used when deriving error bounds.

In pictures...

...

In maths...

A Rademacher random variable σ is a random variable taking values in {1,+1} with probabilities P(σ=1)=P(σ=1)=1/2.

Given a function class F and a sequence \g X_n = (X_i)_{1\leq i\leq n} of n independent and identically distributed random variables, the Rademacher average of \F is defined as \mathcal{R}_n(\F) = \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) , where \g \sigma_n = (\sigma_i)_{1\leq i\leq n} is a sequence of independent Rademacher random variables that are also independent of \g X_n.

In the definition above, the expectation is taken over both the Rademacher variables \g \sigma_n and the sequence \g X_n. The empirical (or conditional) Rademacher average of \F, \hat{\mathcal{R}}_n(\F), is defined similarly except that the sequence \g X_n is fixed and the expectation is taken only over the Rademacher variables: \hat{\mathcal{R}}_n(\F) = \E_{\g \sigma_n} \left[ \left. \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i )\ \right|\ \g X_n \right] .

By the law of total expectation, the two quantities defined above are directly related by \mathcal{R}_n(\F) = \E_{\g X_n} \hat{\mathcal{R}}_n(\F) .

Properties of Rademacher averages

Rademacher averages of a finte class \F = \{f_1,\dots,f_N\} can be bounded by ...

Rademacher averages are invariant under translation, i.e, for any scalar a\in\R, \mathcal{R}_n(a+\F) = \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i (a+f(X_i )) = \mathcal{R}_n(\F).

Proof: We have \begin{align} \mathcal{R}_n(a+\F) &= \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i (a+f(X_i )) \\ &= \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \left( \frac{1}{n} \sum_{i=1}^n \sigma_i a+ \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) \right) \\ &= \E_{\g X_n, \g \sigma_n}\left[ \frac{1}{n} \sum_{i=1}^n \sigma_i a + \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) \right] & (\sup_b (a+b) = a+\sup_b b )\\ &= \frac{a}{n} \sum_{i=1}^n \E_{\sigma_i} \sigma_i + \E_{\g X_n, \g \sigma_n}\sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) & (\mbox{by the linearity of the expectation}) \\ &= 0 + \mathcal{R}_n(\F) & (\mbox{the } \sigma_i \mbox{ are centered }). \end{align}

Contraction principle

The contraction principle provides an upper bound on the Rademacher average of a function class composed with another function.

If \phi: \R\rightarrow \R is a Lipschitz continuous function with Lipschitz constant L_{\phi}, i.e., if \forall (t,s)\in \R^2,\quad |\phi(t) - \phi(s)| \leq L_{\phi} |t-s|, then \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i \phi \circ f(X_i ) \ \leq \ L_{\phi}\ \E_{\g X_n, \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) and thus \mathcal{R}_n ( \phi \circ \F ) \leq L_{\phi} \mathcal{R}_n ( \F ) , where \phi \circ \F denotes the class of functions \phi \circ f with f \in\F.

The version of the contraction principle shown here is actually not restricted to contractions, meaning that the result still holds if \phi has a Lipschitz constant larger than 1.

We will show that, for any T\subset \R^n and any sequence of functions \phi_i with Lipschitz constants L_{\phi_i}, \E_{\g \sigma_n} \sup_{t \in T } \sum_{i=1}^n \sigma_i \phi_i( t_i) \leq \E_{\g \sigma_n} \sup_{t \in T } \sum_{i=1}^n L_{\phi_i}\sigma_i t_i from which the statement follows by letting T = \{ \g t \in \R^n \ :\ t_i = f(X_i) ,\ X_i \in\X \}, \phi_i = \phi, taking expectations with respect to \g X_n and dividing both sides by n.

Step 1: To prove the above, we first concentrate on showing that with only two variables, for any T_2\subset \R^2, \E_{\sigma_2} \sup_{t \in T_2 \subset \R^2 } ( t_1 + \sigma_2 \phi(t_2) ) \leq \E_{\sigma_2} \sup_{t \in T_2\subset \R^2 } ( t_1 + L_{\phi_2} \sigma_2 t_2), the right-hand side of which can be rewritten as \begin{align} R &= \frac{1}{2} \sup_{t \in T_2 } ( t_1 + L_{\phi_2} t_2) + \frac{1}{2} \sup_{t \in T_2 } ( t_1 - L_{\phi_2}t_2) \\ &= \frac{1}{2} \sup_{t \in T_2, s\in T_2 } ( t_1 + L_{\phi_2}t_2 + s_1 - L_{\phi_2} s_2) \end{align} while the left-hand side is \begin{align} L &= \frac{1}{2} \sup_{t \in T_2} ( t_1 + \phi(t_2)) + \frac{1}{2} \sup_{t \in T_2 } ( t_1 - \phi(t_2)) \\ &= \frac{1}{2} \sup_{t \in T_2, s\in T_2 } ( t_1 + \phi(t_2) + s_1 - \phi(s_2) )\\ &= \sup_{t\in T_2, s\in T_2} I(t,s) \end{align} with I(t,s) = \frac{1}{2}(t_1 + \phi(t_2) + s_1 - \phi(s_2) ) . These computations show that if R \geq I(t,s) for all (t,s), then R \geq L.

Step 1 - case 1: Assume t_2 \geq s_2. Then, by contraction (or more precisely by the Lipschitz condition), we have |\phi(t_2) - \phi(s_2)| \leq L_{\phi_2} |t_2 - s_2| = L_{\phi_2} (t_2 - s_2) and thus \phi(t_2) - \phi(s_2) \leq L_{\phi_2}t_2 - L_{\phi_2}s_2 and L_{\phi_2}s_2 - \phi(s_2) \leq L_{\phi_2}t_2 - \phi(t_2) . Thus, \begin{align} 2I(t,s) &= t_1 + \phi(t_2) + s_1 - \phi(s_2)\\ &= t_1 + \phi(t_2) + s_1 - \phi(s_2) + L_{\phi_2}s_2 - L_{\phi_2}s_2 \\ &\leq t_1 + \phi(t_2) + s_1 + L_{\phi_2}t_2 - \phi(t_2) - L_{\phi_2}s_2 \\ &\leq t_1 + L_{\phi_2} t_2 + s_1 - L_{\phi_2}s_2 \end{align} and \begin{align} I(t,s) &\leq \frac{1}{2}\sup_{t\in T_2, s\in T_2 } (t_1 + L_{\phi_2}t_2 + s_1 - L_{\phi_2} s_2) = R. \end{align}

Step 1 - case 2: Assume t_2 \leq s_2. Then, by contraction, we have |\phi(t_2) - \phi(s_2)| \leq L_{\phi_2}|t_2 - s_2| = L_{\phi_2}s_2 - L_{\phi_2}t_2 and thus \phi(t_2) - \phi(s_2) \leq L_{\phi_2}s_2 - L_{\phi_2}t_2 and L_{\phi_2}t_2 + \phi(t_2) \leq L_{\phi_2} s_2 + \phi(s_2) Thus, \begin{align} 2I(t,s) &= t_1 + \phi(t_2) + s_1 - \phi(s_2) + L_{\phi_2}t_2 - L_{\phi_2}t_2 \\ &\leq t_1 - L_{\phi_2}t_2 + s_1 - \phi(s_2) + L_{\phi_2} s_2 + \phi(s_2) \\ &\leq t_1 - L_{\phi_2}t_2 + s_1 + L_{\phi_2} s_2 \end{align} and \begin{align} I(t,s) &\leq \frac{1}{2}\sup_{t\in T_2, s\in T_2 } (t_1 +L_{\phi_2} t_2 + s_1 - L_{\phi_2}s_2) =R. \end{align}

Step 1 - conclusion: For all (t,s)\in T_2^2, I(t,s) \leq R and L= \sup_{t\in T_2, s\in T_2} I(t,s) \leq R.

Step 2: As an illustrative example, choose T_2 = \left\{ (u_1, u_2) \in \R^2 : u_1 = \sum_{i\neq 2} \sigma_i t_i,\ u_2 = t_2,\ \g t \in T ,\ \sigma_i \in\{-1,+1\} \right\} and apply the previous result while conditioning on all Rademacher variables but \sigma_2: \E_{\sigma_2} \left[\left. \sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + \sigma_2 \phi_2(t_2) \ \right| \sigma_{i\neq 2}\right] \leq \E_{\sigma_2} \left[\left. \sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + L_{\phi_2}\sigma_2 t_2\ \right| \sigma_{i\neq 2} \right] Thus, \begin{align} \E_{\g \sigma_n} \sup_{\g t \in T } \left( \sum_{i\neq 2} \sigma_i t_i + \sigma_2 \phi_2(t_2) \right) &= \E_{ \sigma_i, i\neq 2} \E_{\sigma_2} \left[\left. \sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + \sigma_2 \phi_2(t_2) \ \right| \sigma_{i\neq 2}\right] \\ & \leq\E_{ \sigma_i, i\neq 2} \E_{\sigma_2} \left[\left. \sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + L_{\phi_2}\sigma_2 t_2\ \right| \sigma_{i\neq 2}\right] \\ & \leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i\neq 2} \sigma_i t_i + L_{\phi_2}\sigma_2 t_2 \end{align} and we have shown that \E_{\g \sigma_n} \sup_{\g t \in T } \sum_{i} \sigma_i t_i decreases when replacing t_2 by \phi_2(t_2).
In fact, we can iterate from j=1 to n with T_j = \left\{ (u_1, u_2) \in \R^2 : u_1 = \sum_{i < j} \sigma_i \phi_i(t_i) ,\ u_2 = t_j,\ \g t\in T ,\ \sigma_i \in\{-1,+1\} \right\} to show that \begin{align} \E_{\g \sigma_n} \sup_{\g t \in T } \sigma_1 \phi_1(t_1) &\leq \E_{\g \sigma_n}\sup_{t_1 \in T } L_{\phi_1}\sigma_1 t_1 \\ \vdots \\ \E_{\g \sigma_n} \sup_{\g t \in T } \sum_{i \leq j} \sigma_i \phi_i(t_i) = \E_{\g \sigma_n} \sup_{\g t \in T } \left( \sum_{i < j} \sigma_i \phi_i(t_i) + \sigma_j \phi_j(t_j) \right) &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i < j} \sigma_i \phi_i(t_i) + L_{\phi_j}\sigma_j t_j & \mbox{(apply result of Step 1 with } T_j)\\ &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i < j} \sigma_i \phi_i(t_i) + \E_{\g \sigma_n}\sup_{\g t \in T } L_{\phi_j}\sigma_j t_j & \mbox{(by the linearity of } \E) \\ &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i < j} L_{\phi_i}\sigma_i t_i + \E_{\g \sigma_n}\sup_{\g t \in T } L_{\phi_j}\sigma_j t_j & \mbox{(use previous results for } i < j) \\ &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i < j} L_{\phi_i}\sigma_i t_i + L_{\phi_j}\sigma_j t_j\\ &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i \leq j} L_{\phi_i}\sigma_i t_i\\ \vdots \\ \E_{\g \sigma_n} \sup_{\g t \in T } \sum_{i\leq n} \sigma_i \phi_i(t_i) &\leq \E_{\g \sigma_n}\sup_{\g t \in T } \sum_{i\leq n} L_{\phi_i}\sigma_i t_i . \end{align}

A trivial upper bound on the Rademacher average of bounded functions

Given a set of bounded functions, \F\subset \X \rightarrow [a,b], of diameter M = b-a, let \tilde{\F} = \F - \frac{a+b}{2} be the centered version of \F. Then, we have \begin{align} \mathcal{R}_n ( \F ) &= \mathcal{R}_n ( \tilde{\F} ) &\mbox{(by the translation invariance of } \mathcal{R}_n ) \\ &=\E_{\g X_n, \g \sigma_n} \sup_{f \in \tilde{\F} } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i ) \\ &= \E_{\g X_n, \g \sigma_n} \sup_{f \in \tilde{\F} } \frac{1}{n} \inner{\g \sigma_n}{\g f_n} & \left(\mbox{with } \g f_n = (f(X_i))_{1\leq i \leq n} \right) \\ &\leq \E_{\g X_n, \g \sigma_n} \sup_{f \in \tilde{\F} } \frac{1}{n} \|\g \sigma_n\| \| \g f_n \| &\mbox{(by the } \href{innerproduct.html#cauchy}{\mbox{Cauchy-Schwarz inequality}}) \\ &\leq \frac{1}{n} \E_{\g X_n, \g \sigma_n} \|\g \sigma_n\| \sup_{f \in \tilde{\F} } \| \g f_n \| \\ &\leq \frac{1}{n} \E_{\g X_n, \g \sigma_n} \sqrt{n} \sqrt{n \frac{M^2}{4}} & \mbox{(since } |\sigma_i| = 1 \mbox{ and } |f(X_i)| \leq M / 2) \\ &\leq \frac{M}{2} \end{align}

Notes

The contraction principle is shown here in its most useful form for machine learning. Its proof is adapted from the original and more general Theorem 4.12 in (though the original version focuses on contractions with L_{\phi} \leq 1 and does not include L_{\phi} in the result).

The contraction principle is often stated in the machine learning literature with an additional requirement that \phi(0)=0, which is an artifact of the more general proof from that is unnecessary in proving the version stated above. In addition, such a requirement can be easily relaxed by using the translation invariant property of Rademacher averages with a=\phi(0) and by applying the contraction principle to \tilde{\phi} = \phi - a: \mathcal{R}_n(\phi\circ \F) = \mathcal{R}_n(a + \tilde{\phi}\circ \F) = \mathcal{R}_n(\tilde{\phi}\circ \F) \leq L_{\tilde{\phi}} \mathcal{R}_n ( \F ) = L_{\phi} \mathcal{R}_n ( \F ).

Finally, note that the proof of the contraction principle actually proves a more general result in which the function \phi can be replaced by a sequence of functions \phi_i, that can be anything and possibly depend on \g X_n as long as they are Lipschitz continuous.

References

  1. Ledoux, M. and Talagrand, M., Probability in Banach Spaces. Springer, 1991.