Margin-based error bound

In words...

Large-margin classifiers are those which return a binary label by thresholding a real-valued function. For these classifiers, margin-based error bound provide a finer analysis than VC bounds by taking into account their real-value output.

A margin-based error bound can be derived by following a similar path as for the first VC bound and substituting a covering number for the growth function.

The bound thus obtained can be difficult to use due to the dependence of the covering number on the training sample size. However, the bound can be reformulated in terms of the fat-shattering dimension, which is independent of the sample size.

In pictures...

In maths...

Consider a binary classification problem and a set of large-margin classifiers based on a real-valued function $g$: $$ \F=\{f \in \{-1,1\}^{\X} : \forall x\in\X,\ f( x) = \sign ( g(x)), \ g\in\G\subset \R^{\X}\}. $$ For these classifiers we can define a margin-based error rate as $$ R_{emp}^{\gamma} ( f) = \frac{1}{N} \sum_{i=1}^N \I{Y_i g(X_i) < \gamma} $$ on a training sample $D=\{(X_i,Y_i)\}_{i=1}^N$ of size $N$. Define the squashing function $$ \pi_{\gamma}(u) = \begin{cases} \gamma,&\mbox{if } u\geq \gamma\\ -\gamma,&\mbox{if } u \leq - \gamma\\ u,&\mbox{otherwise } \end{cases} $$ and let $\mathcal{N}_{\infty}(\frac{\gamma}{2}, \pi_{\gamma} (\G), 2N )$ denote the uniform covering number of the real-valued function class $\pi_{\gamma} (\G)$ at scale $\gamma/2$ with respect to the $\ell_{\infty}$-pseudo-metric. Then, we have the following bound on the risk $R(f)$ of any classifier $f\in\F$.

For all $\delta > 0$, the bound $$ \forall f\in\F,\quad R(f) \leq R_{emp}^{\gamma}(f) + 2\sqrt{2\frac{\log \mathcal{N}_{\infty}(\frac{\gamma}{2}, \pi_{\gamma} (\G), 2N ) + \log\frac{2}{\delta}}{N}} $$ holds with probability at least $1-\delta$.

Proof: We can bound the true and empirical risks as $$ R(f) = P(f(X) \neq Y) \leq P\left( |\pi_{\gamma}(g(X)) - \gamma Y | \geq \gamma \right)\quad \mbox{ and } \quad R_{emp}(f) = \frac{1}{N} \sum_{i=1}^N \I{f(X) \neq Y} \leq \frac{1}{N} \sum_{i=1}^N \I{|\pi_{\gamma}(g(X_i)) - \gamma Y_i | \geq \gamma} $$ by noting that

$$ f(X) \neq Y \quad \Rightarrow\quad \begin{cases} g(X) < 0 , & \mbox{if } Y=+1 \\ g(X) \geq 0,& \mbox{if } Y=-1 \end{cases} \Rightarrow \begin{cases} \pi_{\gamma}(g(X)) < 0 , & \mbox{if } Y=+1 \\ \pi_{\gamma}(g(X)) \geq 0,& \mbox{if } Y=-1 \end{cases} \Rightarrow \begin{cases} \pi_{\gamma}(g(X)) - Y\gamma < -Y\gamma = -\gamma , & \mbox{if } Y=+1 \\ \pi_{\gamma}(g(X)) - Y\gamma \geq -Y\gamma = \gamma,& \mbox{if } Y=-1 \end{cases} \Rightarrow |\pi_{\gamma}(g(X)) - Y\gamma| \geq \gamma. $$

$$ f(X) \neq Y \quad \Rightarrow\quad |\pi_{\gamma}(g(X)) - \gamma Y | \geq \gamma $$ and thus that the probability (or sample average of the indicator) of the left-hand side event is smaller than the one of the right-hand side event.

The empirical margin-based error can also be expressed in terms of $\pi_{\gamma}$ as $$ R_{emp}^{\gamma} ( f) = \frac{1}{N} \sum_{i=1}^N \I{\pi_{\gamma}(g(X_i)) \neq \gamma Y_i} , $$ since $Y_i g(X_i) <\gamma$ if and only if $\pi_{\gamma}(g(X_i)) \neq \gamma Y_i$.

By symmetrization, we have, for $N\epsilon^2 \geq 2$, $$ P^N \left\{ \sup_{f\in\F} (R(f) - R_{emp}^{\gamma}(f) ) \geq \epsilon \right\} \leq 2 P^{2N} \left\{ \sup_{f\in\F} ( R_{emp}^{\prime}(f) - R_{emp}^{\gamma}(f) )\geq \frac{\epsilon}{2} \right\} , $$ where $R_{emp}^{\prime}(f) $ is the empirical risk of $f$ on an independent ghost sample $D^\prime = \{(X_i^\prime, Y_i^\prime)\}_{i=1}^N$ of size $N$.

Assume now that $\mathcal{C}(\g X_{2N})$ is a minimal $\gamma/2$-net of $\pi_{\gamma}(\G)$ with respect to the $\ell_{\infty}$-pseudo-metric on $\g X_{2N}=\g X_N\g X_N^\prime$, i.e., that (by letting $X_i = X_{i-N}^\prime$ for all $i>N$): $$ \forall g \in \G,\quad \exists c\in \mathcal{C}(\g X_{2N}),\quad |\pi_{\gamma}(g(X_i)) - c(X_i)| < \frac{\gamma}{2} ,\quad i=1,\dots 2N. $$ This implies that $$ \forall g \in \G,\quad \exists c\in \mathcal{C}(\g X_{2N}),\quad \{i\ :\ |\pi_{\gamma}(g(X_i^\prime)) - \gamma Y_i^\prime | \geq \gamma \} \subseteq \left\{i\ :\ |c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} \right\} $$ and that $$ \forall g \in \G,\quad \exists c\in \mathcal{C}(\g X_{2N}),\quad\left\{i\ :\ |c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} \right\} \subseteq \{i\ :\ \pi_{\gamma}(g(X_i)) \neq \gamma Y_i \}. $$ Thus, we have $$ \forall f \in \F,\quad \exists c\in \mathcal{C}(\g X_{2N}),\quad R_{emp}^{\prime}(f) \leq \frac{1}{N} \sum_{i=1}^N \I{|\pi_{\gamma}(g(X_i^\prime)) - \gamma Y_i^\prime | \geq \gamma} \leq \frac{1}{N} \sum_{i=1}^N \I{|c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} } $$ and $$ \forall f \in \F,\quad \exists c\in \mathcal{C}(\g X_{2N}),\quad R_{emp}^{\gamma} ( f) = \frac{1}{N} \sum_{i=1}^N \I{\pi_{\gamma}(g(X_i)) \neq \gamma Y_i} \geq \frac{1}{N} \sum_{i=1}^N \I{|c(X^\prime) - \gamma Y^\prime | \geq \frac{\gamma}{2}}. $$ Therefore, $$ \forall f \in \F,\quad \exists c\in \mathcal{C}(\g X_{2N}),\quad R_{emp}^{\prime}(f) - R_{emp}^{\gamma}(f) )\leq \frac{1}{N} \sum_{i=1}^N ( \I{|c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} } - \I{|c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} } ) $$ and \begin{align} P^{2N} & \left\{ \sup_{f\in\F} ( R_{emp}^{\prime}(f) - R_{emp}^{\gamma}(f) )\geq \frac{\epsilon}{2} \right\} \\ & \leq P^{2N} \left\{ \exists c\in\mathcal{C}(\g X_{2N}) ,\ \frac{1}{N} \sum_{i=1}^N ( \I{|c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} } - \I{|c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} } ) \geq \frac{\epsilon}{2} \right\} \\ & \leq P^{2N} \left\{ \bigcup_{c\in\mathcal{C}(\g X_{2N})} \left\{ \frac{1}{N} \sum_{i=1}^N ( \I{|c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} } - \I{|c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} } ) \geq \frac{\epsilon}{2} \right\} \right\} \\ & \leq \sup_{\g X_{2N}\in \X^{2N}} \sum_{c\in\mathcal{C}(\g X_{2N})} P^{2N} \left\{\frac{1}{N} \sum_{i=1}^N ( \I{|c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} } - \I{|c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} } ) \geq \frac{\epsilon}{2} \right\} & \mbox{(due to the } \href{unionbound.html}{union\ bound})\\ & \leq \sup_{\g X_{2N}\in \X^{2N}} |\mathcal{C}(\g X_{2N})| \sup_{c\in\mathcal{C}(\g X_{2N})} P^{2N} \left\{\frac{1}{N} \sum_{i=1}^N ( \I{|c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} } - \I{|c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} } ) \geq \frac{\epsilon}{2} \right\} & \left(\mbox{due to } \sum_{k=1}^K a_k \leq K \sup_k a_k \right)\\ & \leq \mathcal{N}_{\infty}(\frac{\gamma}{2}, \pi_{\gamma} (\G), 2N ) \sup_{c\in\mathcal{C}(\g X_{2N})} P^{2N} \left\{\frac{1}{N} \sum_{i=1}^N ( \I{|c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} } - \I{|c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} } ) \geq \frac{\epsilon}{2} \right\} \end{align} where the last line is a direct consequence of the definition of the uniform covering number. As for the standard VC bound, we can now introduce the bounded random variables $L_i(\g X_{2N}, c) = \I{|c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} } - \I{|c(X_i) - \gamma Y_i |\geq \frac{\gamma}{2} } \in[-1,1]$, with $\E L_i (\g X_{2N}, c) = 0$, and apply Hoeffding's inequality for all $c$ to obtain $$ \forall c\in\mathcal{C}(\g X_{2N}),\quad P^{2N} \left\{\frac{1}{N} \sum_{i=1}^N ( \I{|c(X_i^\prime) - \gamma Y_i^\prime | \geq \frac{\gamma}{2} } - \I{|c(X_i) - \gamma Y_i | \geq \frac{\gamma}{2} } ) \geq \frac{\epsilon}{2} \right\} \leq \exp\left( \frac{-N \epsilon^2 }{8}\right) , $$ which yields $$ P^{2N} \left\{ \sup_{f\in\F} ( R_{emp}^{\prime}(f) - R_{emp}^{\gamma}(f) )\geq \frac{\epsilon}{2} \right\} \leq \mathcal{N}_{\infty}(\frac{\gamma}{2}, \pi_{\gamma} (\G), 2N ) \exp\left( \frac{-N \epsilon^2 }{8}\right) . $$ Then, setting $\delta = 2 \mathcal{N}_{\infty}(\frac{\gamma}{2}, \pi_{\gamma} (\G), 2N ) \exp\left( \frac{-N \epsilon^2 }{8}\right)$, the symmetrization argument above leads to $$ P^N \left\{ \sup_{f\in\F} (R(f) - R_{emp}^{\gamma}(f) ) \geq \epsilon \right\} \leq \delta $$ with $$ \epsilon = 2 \sqrt{2\frac{\log\mathcal{N}_{\infty}(\frac{\gamma}{2}, \pi_{\gamma} (\G), 2N ) + \log \frac{2}{\delta}}{N}}. $$ We can conclude the proof by using the probability inversion principle.

Notes

The results described in this page were published in .