Error bounds for binary classification based on Rademacher averages

In words...

The Rademacher average-based error bound can be applied to binary classification to bound the risk of classifiers.

Doing so, the risk is shown to be no more than the sum of the empirical risk, the Rademacher average of the set of classifiers and a term that depends only on the number of data and the confidence with which we want this bound to hold.

Some sets of classifiers have infinite Rademacher averages such as the set of linear classifiers operating in an infinite-dimensional reproducing kernel Hilbert space. For these, the bound above is useless, but there is an alternative if every classifier can be expressed as the sign of a smooth function. In this case, a bound based on the Rademacher average of the set of smooth functions can be derived. Such bounds are also based on the notion of margin.

In pictures...

Margin-based cost function

Below is a plot of the margin-based cost function used to prove an error bound with the Rademacher average of the score functions of large-margin classifiers of infinite Rademacher average.

The margin-based cost is an upper bound on the 0-1 loss such that an upper bound on its expectation also provides an upper bound on the expectation of the 0-1 loss, i.e., the risk.
The margin-based cost also has the advantage of being Lipschitz continuous, a property that we use when we apply the contraction principle.

In maths...

Given a set of binary classifiers $\F\subset \{-1,+1\}^{\X}$ with Rademacher average $\mathcal{R}_n(\F)$, the bound $$ \forall f\in \F,\quad R(f) \leq R_{emp}(f) + \mathcal{R}_n(\F) + \sqrt{\frac{ \log \frac{1}{\delta}}{2N}} . $$ holds with probability at least $1-\delta$.

Proof: In order to apply the Rademacher average-based error bound in binary classification, we need to focus on the loss function $$ \ell_f(X,Y) = \ell(Y, f(X)) = \I{f(X) \neq Y} = \frac{1}{2}(1 - Y f(X) ), $$ where this reformulation assumes that both $Y$ and $f(X)$ take values in $\Y = \{-1,1\}$. Let $\mathcal{L}$ denote the loss class, i.e., the class of loss functions induced by $\F$ as $$ \mathcal{L} = \left\{ \ell_f \in [0,1]^{\X\times \Y} : \ell_f(X,Y) = \frac{1}{2}(1 - Y f(X) ), \ f \in \F \right\}. $$ Applying the previously derived bound to $\mathcal{L}$ with $M=1$ yields that, with probability at least $1-\delta$, $$ \sup_{\ell_f\in\mathcal{L}}\left( \E_{X,Y} \ell_f(X,Y) - \frac{1}{n} \sum_{i=1}^n \ell_f(X_i,Y_i) \right) \leq 2 \mathcal{R}_n(\mathcal{L}) + \sqrt{\frac{\log \frac{1}{\delta}}{2n}} , $$ and thus, given the definition of the risk, $R(f)$, and the empirical risk $R_{emp}(f)$ evaluated on a training sample of size $N$, $$ \forall f\in \F,\quad R(f) \leq R_{emp}(f) + 2 \mathcal{R}_n(\mathcal{L}) + \sqrt{\frac{\log \frac{1}{\delta}}{2N}} . $$ The Rademacher average of the loss class can be related to the one of the function class $\F$ as follows: \begin{align} \mathcal{R}_n(\mathcal{L} ) &= \E_{\g X_n \g Y_n \g \sigma_n} \sup_{\ell_f \in \mathcal{L} } \frac{1}{n} \sum_{i=1}^n \sigma_i \ell_f(X_i, Y_i ) \\ &= \frac{1}{2} \E_{\g X_n \g Y_n \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i (1 - Y_i f(X_i) )\\ &= \frac{1}{2} \E_{\g X_n \g Y_n \g \sigma_n} \sup_{f \in \F } \left(\frac{1}{n} \sum_{i=1}^n \sigma_i + \frac{1}{n} \sum_{i=1}^n - \sigma_i Y_i f(X_i) \right)\\ &= \frac{1}{2} \E_{\g X_n \g Y_n \g \sigma_n} \left[ \frac{1}{n} \sum_{i=1}^n \sigma_i + \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n - \sigma_i Y_i f(X_i) \right]\\ &= \frac{1}{2} \E_{\g \sigma_n} \left[ \frac{1}{n} \sum_{i=1}^n \sigma_i \right] + \frac{1}{2} \E_{\g X_n \g Y_n \g \sigma_n} \left[\sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n - \sigma_i Y_i f(X_i) \right] & \mbox{(by the } \href{expectation.html}{\mbox{linearity of } \E} )\\ &= 0 + \frac{1}{2} \E_{\g X_n \g Y_n \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n - \sigma_i Y_i f(X_i) & (\E_{\sigma_i} \sigma_i = 0 )\\ &= \frac{1}{2} \E_{\g X_n \g \sigma_n} \sup_{f \in \F } \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i) & ( -\sigma_iY_i \mbox{ is distributed like } \sigma_i )\\ &= \frac{1}{2}\mathcal{R}_n(\F) \end{align} Therefore, we obtain the desired result.

Margin-based bounds

We now specify error bounds for large-margin classifiers, such as support vector machines, of the form $$ f(\g x) = \begin{cases} 1, & \mbox{if } g(\g x) \geq 0 \\ -1, & \mbox{otherwise,} \end{cases} $$ where $g: \X\rightarrow \R$ is a real-valued score function from a class $\mathcal{G}$. For these classifiers, the loss function can be rewritten as $$ \ell_f(X,Y) = \ell(Y, f(X)) = \I{f(X) \neq Y} = \I{ Y g(X ) <\, 0 }. $$ To introduce the margin, we consider a nonnegative cost function $\phi: \R\rightarrow \R^+$ such that $\forall u\in\R$, $\phi(u) \geq \I{ u > 0 }$. This cost function serves as a surrogate for the loss function in a convexified empirical risk minimization implemented by a training algorithm. For instance, support vector machines use the hinge loss $\phi(u) = (1+u)_+$ with $u= - y_i g(\g x_i)$ and minimize a regularized version of $\overline{R}_{emp}(f) = \frac{1}{N} \sum_{i=1}^N \phi(-Y_i g(X_i) )$.

By definition, we have $\phi(-Y g(X) ) \geq \ell_f ( X, Y)$ and thus the upper bounds $$ R(f) = \E \ell_f( X, Y) \leq \E \phi(-Y g(X) ) = \overline{R}(f),\qquad \mbox{ and } \qquad R_{emp}(f) = \frac{1}{N} \sum_{i=1}^N \ell_f( X_i, Y_i) \leq \frac{1}{N} \sum_{i=1}^N \phi(-Y_i g(X_i) ) = \overline{R}_{emp}(f) $$ With these notations, the contraction principle leads to the following bound.

With probability $1-\delta$, $$ \forall f\in \F,\quad R(f) \leq \overline{R}_{emp}(f) + 2 L_{\phi} \mathcal{R}_n(\mathcal{G}) + M \sqrt{\frac{\log \frac{1}{\delta}}{2N}} , $$ where $L_{\phi}$ denotes the Lipschitz constant of $\phi$ and $M$ is a uniform upper bound on $\phi( -y g(x))$.

Proof: Let define $\mathcal{E}$ as the class of functions of $\X\times \Y \rightarrow \R$ written as $u(X,Y) = -Y g(X)$ with $g\in\mathcal{G}$. Then, \begin{align} R(f) &\leq \overline{R}(f) \\ &\leq \overline{R}_{emp}(f) + \sup_{g \in\mathcal{G} }( \overline{R}(f) - \overline{R}_{emp}(f) )\\ &\leq \overline{R}_{emp}(f) + 2 \mathcal{R}_n(\phi \circ \mathcal{E}) + M\sqrt{\frac{\log \frac{1}{\delta}}{2N}} & (\mbox{use } \href{rademacherbound.html}{\mbox{ the general bound}}) \\ &\leq \overline{R}_{emp}(f) + 2 L_{\phi} \mathcal{R}_n(\mathcal{E}) + M\sqrt{\frac{\log \frac{1}{\delta}}{2N}} & \mbox{(by the } \href{rademacheraverages.html#contraction}{\mbox{ contraction principle}}) \end{align} The final step consists in relating the Rademacher average of $\mathcal{E}$ with the one of $\mathcal{G}$. To do this, we use the fact that $-\sigma_i Y_i$ has the same distribution as $\sigma_i$ and thus that \begin{align} \mathcal{R}_n(\mathcal{E}) &= \E_{\g X_n \g Y_n \g \sigma_n} \sup_{u \in \mathcal{E} } \frac{1}{n} \sum_{i=1}^n -\sigma_i u(X_i, Y_i) \\ &= \E_{\g X_n \g Y_n \g \sigma_n} \sup_{g \in \mathcal{G} } \frac{1}{n} \sum_{i=1}^n -\sigma_i Y_i g(X_i) \\ &= \E_{\g X_n \g \sigma_n} \sup_{g \in \mathcal{G} } \frac{1}{n} \sum_{i=1}^n \sigma_i g(X_i) \\ &= \mathcal{R}_n(\mathcal{G}) \end{align}

Using a piecewise linear cost function

Let define a margin-based cost function as a saturated version of the hinge loss with the hinge location parametrized by a margin $\gamma > 0$: $$ \phi(u) = \begin{cases} 0, & \mbox{if } u \leq -\gamma \\ 1, & \mbox{if } u \geq 0\\ 1 + u/\gamma, & \mbox{otherwise.} \end{cases} $$ Also define the margin-based error rate: $$ R_{emp}^{\gamma} ( f) = \frac{1}{N} \sum_{i=1}^N \I{Y_i g(X_i) < \gamma} . $$ This error rate is similar to the empirical risk $R_{emp}$ except that it also counts correct classifications with too small confidence as errors, where the definition of ``too small" depends on the margin $\gamma$. Indeed, seeing $|g(X_i)|$ as the score assigned by the classifier to the class $\sign(g(X_i))$, large values of $|g(X_i)|$ provide a high confidence in the prediction, while small values increase the similarity of the scores for the two classes and thus also increase the ambiguitiy.

With this choice of $\phi$, we have $\overline{R}_{emp}(f) \leq R_{emp}^{\gamma} ( f)$ (see plot), $L_{\phi} = 1/\gamma$ and $M=1$. Thus, as a direct consequence of the previous bound, we obtain the following margin-based bound $$ R(f) \leq R_{emp}^{\gamma} ( f) + \frac{2}{\gamma} \mathcal{R}_n(\mathcal{G}) + \sqrt{\frac{\log \frac{1}{\delta}}{2N}} . $$ As the margin $\gamma$ grows, the margin-error $R_{emp}^{\gamma} ( f)$ increases (the larger the margin, the more points lie within the margin and are counted as errors), but the second term in the bound decreases. This promotes the search for classifiers that can obtain small margin errors with a large margin and can be used to explain the good generalization capacities of support vector machines which minimize (a convex surrogate of) the margin error while maximizing the margin.

Application to kernel methods

Kernel-based classifiers rely on functions $g$ that belong to an RKHS $\H$ of reproducing kernel $K$. In order to apply the bound above, we need to be more specific and assume that $g$ is constrained to a ball of radius $\lambda$ within $\H$: $$ g\in \H_{\lambda} = \left\{ g \in \H\ : \|g\|_{\H} \leq \lambda \right\}. $$ In this case, we can compute the Rademacher average of $\mathcal{\H_{\lambda}}$ as \begin{align} \mathcal{R}_n(\H_{\lambda}) &= \E_{\g X_n \g \sigma_n} \sup_{g \in \H_{\lambda} } \frac{1}{n} \sum_{i=1}^n \sigma_i g(X_i) \\ &= \E_{\g X_n \g \sigma_n} \sup_{g \in \H_{\lambda} } \frac{1}{n} \sum_{i=1}^n \sigma_i \inner{g}{K(X_i, \cdot)}_{\H} & (\mbox{by the }\href{rkhs.html}{\mbox{reproducing property of } K} )\\ &= \E_{\g X_n \g \sigma_n} \sup_{ \|g\|_{\H} \leq \lambda } \frac{1}{n} \sum_{i=1}^n \sigma_i \inner{g}{K(X_i, \cdot)}_{\H} \\ &= \E_{\g X_n \g \sigma_n} \sup_{ \|g\|_{\H} \leq \lambda } \frac{1}{n} \inner{g}{\sum_{i=1}^n \sigma_i K(X_i, \cdot)}_{\H} & (\mbox{by the }\href{innerproduct.html}{\mbox{linearity of the inner product}}) \\ &= \E_{\g X_n \g \sigma_n} \frac{\lambda }{n} \left\|\sum_{i=1}^n \sigma_i K(X_i, \cdot) \right\|_{\H} & (\mbox{by the }\href{innerproduct.html}{\mbox{Cauchy-Schwartz inequality} } ) \end{align} Next we use the Kahane-Khinchine inequality which sates that for any $h_1, \dots, h_n$ in a Hilbert space, $$ \frac{1}{\sqrt{2}} \E_{\g \sigma_n} \left\| \sum_{i=1}^n \sigma_i h_i \right\|_{\H}^2 \leq \left(\E_{\g \sigma_n} \left\|\sum_{i=1}^n \sigma_i h_i \right\|_{\H} \right)^2 \leq \E_{\g \sigma_n} \left\|\sum_{i=1}^n \sigma_i h_i \right\|_{\H}^2 $$ (note that Jensen's inequality is enough for the upper bound).
Given that the $h_i$ belong to a Hilbert space where $\|h\|_{\H}^2 = \inner{h}{h}_{\H}$, we also have \begin{align} \E_{\g \sigma_n} \left\|\sum_{i=1}^n \sigma_i h_i \right\|_{\H}^2 &= \E_{\g \sigma_n} \inner{\sum_{i=1}^n \sigma_i h_i}{ \sum_{j=1}^n \sigma_j h_j}_{\H}\\ &= \E_{\g \sigma_n} \sum_{i=1}^n \sigma_i \inner{ h_i}{ \sum_{j=1}^n \sigma_j h_j}_{\H} & (\mbox{by the }\href{innerproduct.html}{\mbox{linearity of the inner product}}) \\ &= \E_{\g \sigma_n} \sum_{i=1}^n \sum_{j=1}^n \sigma_i\sigma_j \inner{h_i}{ h_j}_{\H} \\ &= \E_{\g \sigma_n} \left[\sum_{i=1}^n \sigma_i^2 \inner{h_i}{ h_i}_{\H} + \sum_{j\neq i} \sigma_i\sigma_j \inner{h_i}{ h_j}_{\H}\right] \\ &= \E_{\g \sigma_n} \sum_{i=1}^n \inner{h_i}{ h_i}_{\H} + \E_{\g \sigma_n} \sum_{j\neq i} \sigma_i\sigma_j \inner{h_i}{ h_j}_{\H} & (\mbox{by the }\href{expectation.html}{\mbox{linearity of the expectation}}) \\ &= \sum_{i=1}^n\|h_i\|_{\H}^2 + \sum_{j\neq i} \E_{\sigma_i\sigma_j} \sigma_i\sigma_j \inner{h_i}{ h_j}_{\H}\\ &= \sum_{i=1}^n\|h_i\|_{\H}^2 + \sum_{j\neq i} \inner{h_i}{ h_j}_{\H} \E_{\sigma_i\sigma_j}\sigma_i\sigma_j & ( \inner{h_i}{ h_j}_{\H} \mbox{ are constants for the expectations)}\\ &= \sum_{i=1}^n\|h_i\|_{\H}^2 + \sum_{j\neq i} \inner{h_i}{ h_j}_{\H} \E_{\sigma_i} \sigma_i\E_{\sigma_j} \sigma_j & (\mbox{by the independence of } \sigma_i \mbox{ and } \sigma_j \mbox{ with } i\neq j )\\ &= \sum_{i=1}^n\|h_i\|_{\H}^2 & (\mbox{since } \E_{\sigma_i} \sigma_i = 0). \end{align} Therefore, gathering the results and using the reproducing property of $K$, we have $$ \mathcal{R}_n(\H_{\lambda}) \leq \frac{\lambda }{n} \E_{\g X_n} \sqrt{ \sum_{i=1}^n \left\| K(X_i, \cdot) \right\|_{\H}^2 } = \frac{\lambda }{n} \E_{\g X_n}\sqrt{ \sum_{i=1}^n \inner{K(X_i, \cdot)}{K(X_i, \cdot)}_{\H} } = \frac{\lambda }{n} \E_{\g X_n}\sqrt{ \sum_{i=1}^n K(X_i, X_i) } $$ and also the lower bound $$ \mathcal{R}_n(\H_{\lambda}) \geq \frac{\lambda }{n\sqrt{2}} \E_{\g X_n}\sqrt{ \sum_{i=1}^n K(X_i, X_i) } . $$

Finally, this leads to the bound (which holds with probability at least $1-\delta$) $$ R(f) \leq R_{emp}^{\gamma} ( f) + \frac{2\lambda }{\gamma N} \E_{\g X_n}\sqrt{ \sum_{i=1}^N K(X_i, X_i) } + \sqrt{\frac{\log \frac{1}{\delta}}{2N}} $$ on the risk of a large-margin kernel-based classifier.

As an example, if $K$ is a Gaussian RBF kernel, then $\forall X\in\X$, $K(X,X) = 1$ and thus $\mathcal{R}_n(\H_{\lambda})\leq \frac{\lambda \sqrt{n}}{n} = \lambda / \sqrt{n}$, leading to $$ R(f) \leq R_{emp}(f) + O\left(\frac{1}{\sqrt{N}}\right). $$