My first error bound (and why it is useless)

In words...

A simple error bound can be obtained by a straightforward application of Hoeffding's inequality.

However, such a bound is useless since it requires the knowledge of the classifier computed by the algorithm before drawing the training data. Nonetheless, useful bounds are often based on this one.

In pictures...

In maths...

We know that the 0-1 loss function for classification is bounded as $$ \ell(Y,f(X)) = \I{f(X)\neq Y} \in \{0,1\} \subset [0,1]. $$ Thus, we can apply (the second) Hoeffding's inequality to the random variables $L_i = \ell(Y_i,f(X_i))$ and obtain $$ P^N \left\{ R(f) - R_{emp}(f) > \epsilon \right\} \leq \exp\left(-2N\epsilon^2\right), $$ with $R(f) = \E [L_i]$ and $R_{emp}(f) = \frac{1}{N}\sum_{i=1}^N L_i$.

In order to rewrite this as an explicit error bound, let denote the right-hand side by $$ \delta = \exp\left(-2N\epsilon^2\right). $$ Then, we can express $\epsilon$ as a function of $\delta$: $$ \epsilon = \sqrt{\frac{\log\frac{1}{\delta}}{2N}}, $$ which leads to $$ P^N \left\{ R(f) - R_{emp}(f) > \sqrt{\frac{\log\frac{1}{\delta}}{2N}} \right\} \leq \delta. $$ Finally, applying the probability inversion principle, we obtain that

For all $f$ and all $\delta>0$, the bound $$ R(f) \leq R_{emp}(f) + \sqrt{\frac{\log\frac{1}{\delta}}{2N}} $$ holds with probability at least $1-\delta$.

Though maybe attractive at first sight, this bound is only true for a function $f$ fixed a priori. In particular, the bound states that for a given classifier $f$, there exists some training set for which $R(f) \leq R_{emp}(f) + \sqrt{\frac{\log\frac{1}{\delta}}{2N}}$ and that the probability of drawing this data set is at least $1-\delta$.

However, the particular data set leading to this bound can be different for each classifier $f$. This means that in a practical learning situation where one has a fixed data set, the bound is only valid for some functions $f$ without any guarantee for the one $f$ returned by the algorithm. Therefore, the bound above cannot be used to quantify the quality of an algorithm.