Margin-based error bound uniform in the margin

In words...

Margin-based error bounds such as the one based on covering numbers or the one based on the fat-shattering dimension assume that the value of the margin is fixed in advance, before seeing the data. But most margin-based classifiers, such as the support vector machines maximize the margin during training and thus yield a data-dependent margin.

However, it is possible to reformulate the error bounds in such a way as to make them uniform over all values of the margin and thus valid for margin-maximization training algorithms.

In pictures...

In maths...

Consider a binary classification problem and a set of large-margin classifiers based on a real-valued function $g$: $$ \F=\{f \in \{-1,1\}^{\X} : \forall x\in\X,\ f( x) = \sign ( g(x)), \ g\in\G\subset \R^{\X}\}. $$ For these classifiers we can define a margin-based error rate as $$ R_{emp}^{\gamma} ( f) = \frac{1}{N} \sum_{i=1}^N \I{Y_i g(X_i) < \gamma} $$ on a training sample $D=\{(X_i,Y_i)\}_{i=1}^N$ of size $N$. Then, the following bound based on the fat-shattering dimension of the real-valued function class $\G$ can be derived.

For all $\delta > 0$, the bound $$ \forall f\in\F,\quad R(f) \leq R_{emp}^{\gamma}(f) + 2\sqrt{2\frac{\fat_{\gamma/32}(\G)\log \frac{34 eN}{\fat_{\gamma/32}(\G)} \log_2 (578N) + \log \frac{8}{\gamma\delta}}{N}} $$ holds with probability at least $1-\delta$ and every $\gamma\in(0,1]$.

Proof: Define the set of events $$ E(\gamma_1, \gamma_2, \delta) = \left\{ D \in (\X\times\Y)^N\ :\ \exists f\in\F,\ R(f) \geq R_{emp}^{\gamma_2}(f) + 2\sqrt{2\frac{\fat_{\gamma_1/16}(\G)\log \frac{34 eN}{\fat_{\gamma_1/16}(\G)} \log_2 (578N) + \log\frac{4}{\delta}}{N}} \right\} $$ Using the error bound with a fixed margin, we have, for any fixed margin $\gamma\in(0,1]$ and $\delta_2\in(0,1]$: \begin{align} P( E(\gamma, \gamma, \delta_2) ) % &= P^N \left\{ \exists f\in\F,\ R(f) \geq R_{emp}^{\gamma }(f) + 2\sqrt{2\frac{\fat_{\gamma/32}(\G)\log \frac{34 eN}{\fat_{\gamma/32}(\G)} \log_2 (578N) + \log\frac{4}{\delta_2}}{N}} \right\} \\ &= P^N \left\{ \exists f\in\F,\ R(f) \geq R_{emp}^{\gamma }(f) + 2\sqrt{2\frac{\fat_{\gamma/16}(\G)\log \frac{34 eN}{\fat_{\gamma/16}(\G)} \log_2 (578N) + \log\frac{2}{\delta_2}}{N}} \right\}\\ &\leq \delta_2 . \end{align} We also have that for all $0<\gamma_1\leq \gamma \leq \gamma_2 \leq 1$, $$ R_{emp}^{\gamma_2} \geq R_{emp}^{\gamma}\quad \mbox{ and } \quad \fat_{\gamma_1/32}(\F) \geq \fat_{\gamma/32}(\F) $$ and thus, by the fact that the error bound is an increasing function of the fat-shattering dimension, $$ E(\gamma_1,\gamma_2,\delta) \subseteq E(\gamma, \gamma, \delta), $$ which also yields, for all $0 < \delta_1 \leq \delta_2 \leq 1$, $$ E(\gamma_1,\gamma_2,\delta_1) \subseteq E(\gamma, \gamma, \delta_2). $$ Then, the general result below applied with $a=1/2$ and $\alpha=\gamma$ leads to $$ P\left(\bigcup_{\gamma\in (0,1]} E(\gamma /2,\gamma,\delta \gamma/2 ) \right) \leq \delta, $$ which can be rewritten as $$ P^N \left\{ \exists f\in\F,\ R(f) \geq R_{emp}^{\gamma }(f) + 2\sqrt{2\frac{\fat_{\gamma/32}(\G)\log \frac{34 eN}{\fat_{\gamma/32}(\G)} \log_2 (578N) + \log\frac{4}{\delta_2}}{N}} \right\} \leq \delta $$ and lead to the conclusion by the probability inversion principle.

A useful general result

Let $(\X,\Sigma,P)$ be a probability space let $$ \{ E(\alpha_1, \alpha_2, \delta)\ : (\alpha_1, \alpha_2, \delta \in (0,1]^3 \} $$ be a set of events satisfying

for all $(\beta,\delta_2) \in(0,1]^2$, $$P( E(\beta,\beta,\delta_2)) \leq \delta_2;$$
for all $a \in(0,1)$ and $\delta \in (0,1]$, $\bigcup_{\alpha\in (0,1]} E(\alpha a,\alpha,\delta \alpha(1-a) )$ is measurable;
for all $0<\alpha_1\leq \beta \leq \alpha_2 \leq 1$ and $0 < \delta_1 \leq \delta_2 \leq 1$, $$E(\alpha_1,\alpha_2,\delta_1) \subseteq E(\beta, \beta, \delta_2).$$

Then, for all $(a,\delta)\in(0,1)^2$, $$ P\left(\bigcup_{\alpha\in (0,1]} E(\alpha a,\alpha,\delta \alpha(1-a) ) \right) \leq \delta. $$

Proof: Using $a<1$, we can can divide the interval $(0,1]$ into an infinite number of pieces as $$ (0,1] = \bigcup_{i=0}^{\infty} (a^{i+1}, a^i] $$ and the event similarly as $$ \bigcup_{\alpha\in (0,1]} E(\alpha a,\alpha,\delta \alpha(1-a) ) = \bigcup_{i=0}^\infty \{ E(\alpha a,\alpha,\delta \alpha(1-a) ) : \alpha \in (a^{i+1}, a^i] \} . $$ Within each piece of event, for all $\alpha\in (a^{i+1}, a^i]$, $\alpha_1 = \alpha a \leq a^{i+1} = \beta$, $\beta < \alpha = \alpha_2$ and $\delta_1 = \delta\alpha(1-a) < \delta a^i (1-a) = \delta_2$. Thus, assumption 3 yields $$ \forall \alpha\in (a^{i+1}, a^i],\quad E(\alpha a,\alpha,\delta \alpha(1-a) ) \subseteq E(a^{i+1},a^{i+1},\delta a^i (1-a) ) $$ and $$ \{ E(\alpha a,\alpha,\delta \alpha(1-a) ) : \alpha \in (a^{i+1}, a^i] \} \subseteq E(a^{i+1},a^{i+1},\delta a^i (1-a) ) . $$ Therefore, \begin{align} P\left(\bigcup_{\alpha\in (0,1]} E(\alpha a,\alpha,\delta \alpha(1-a) ) \right) &= P\left( \bigcup_{i=0}^\infty \{ E(\alpha a,\alpha,\delta \alpha(1-a) ) : \alpha \in (a^{i+1}, a^i] \} \right) \\ &\leq P\left( \bigcup_{i=0}^\infty E(a^{i+1},a^{i+1},\delta a^i (1-a) ) \right) \\ &\leq \sum_{i=0}^\infty P\left( E(a^{i+1},a^{i+1},\delta a^i (1-a) ) \right) & \mbox{(due to the }\href{unionbound.html}{union\ bound})\\ &\leq \sum_{i=0}^\infty \delta a^i (1-a) & \mbox{(by assumption 1) } \\ &\leq \delta \end{align} where the last line is due to the geometric series formula: $\sum_{i=0}^\infty a^i = \frac{1}{1- a}$ for $|a|<1$.