Empirical Risk Minimization (ERM)

In words...

The Empirical Risk Minimization (ERM) principle is a learning paradigm which consists in selecting the model with minimal average error over the training set. This so-called training error can be seen as an estimate of the risk (due to the law of large numbers), hence the alternative name of empirical risk.

By minimizing the empirical risk, we hope to obtain a model with a low value of the risk. The larger the training set size is, the closer to the true risk the empirical risk is.

If we were to apply the ERM principle without more care, we would end up learning by heart, which we know is bad. This issue is more generally related to the overfitting phenomenon, which can be avoided by restricting the space of possible models when searching for the one with minimal error. The most severe and yet common restriction is encountered in the contexts of linear classification or linear regression. Another approach consists in controlling the complexity of the model by regularization.

In pictures...

The ERM is a nice idea, if used with care

The plot below shows a regression problem with a training set of 15 points.

$x$ $y$

In maths...

The ERM principle is an inference principle which consists in finding the model $\hat{f}$ by minimizing the empirical risk: $$ \hat{f} = \arg\min_{f: \X\rightarrow \Y} \ R_{emp}(h) $$ where the empirical risk is an estimate of the risk computed as the average of the loss function over the training sample $D=\{ (X_i,Y_i) \}_{i=1}^N $: $$ R_{emp}(f) = \frac{1}{N}\sum_{i=1}^N \ell(f(X_i), Y_i) $$ with the loss function $\ell$.

Given a training set, $\{ (x_i,y_i) \}_{i=1}^N$, assumed to be a realization of the training sample, the ERM principle becomes a practical method that learns the model by solving $$ \hat{f} = \arg\min_{f: \X\rightarrow \Y} \ \frac{1}{N}\sum_{i=1}^N \ell(f(x_i), y_i) $$ and thus by minimizing the training error defined as the average loss over the training set.

Without additional constraints, the optimization problem above has a number of trivial solutions of the form $$ \hat{f}(x) = \begin{cases} y_i ,&\mbox{if }\ x = x_i\ \mbox{for some } x_i \mbox{ in the training set } \\ \mbox{any value } \in \Y, & \mbox{otherwise}, \end{cases} $$ which all correspond to learning by heart, an extreme form of overfitting.

In order to avoid such issues, we should constrain $f$ to some suitable model space $\F$, i.e., replacing the minimization over the set of functions $f: \X\rightarrow \Y$ by a minimization over $f\in\F$.

Notes

The training procedure suggested by the ERM amounts to an optimization problem, as is often the case with learning algorithms.