Given a supervised learning problem with a fixed distribution of the data and a chosen loss function, there is an optimal model minimizing the risk. This model is often called the **target model** and its risk is the **Bayes risk**.

Many learning algorithms search for the model within a predefined function class. In this case, either the optimal model belongs to this class or not. But even if the optimal model lies within the reach of the algorithm, this does not necessarily means that the algorithm will find it. In fact, in many cases, the resulting model will be worse than what could be obtained by using a more restricted function class without the optimal model.

This phenomenon can be explained in terms of the estimation and approximation errors.

The **estimation error** is the error implied by the fact that the algorithm works with a finite training set that only partially reflects the true distribution of the data. By considering that the training set is obtained by randomly sampling the true distribution, this also incurs a certain variability in the resulting model.

The **approximation error** is the error implied by the choice of function class and is defined as the difference in risk obtained by the best model within the function class and the optimal model.

If we let the function class be large enough to contain the optimal model, then the approximation error can be zero, but on the other hand, the estimation error increases as the larger the function class is, the less likely the algorithm is to find the best model in the class.

Conversely, we can decrease the estimation error by reducing the size of the function class (think of the extreme case where the class contains a single function), but doing so, we probably increase the approximation error.

Therefore, a major concern in machine learning is to find the optimal trade-off between the estimation and approximation errors, which is the aim of model selection.

Since most function classes of interest contain an infinite number of functions, the notion of "size" of a class must be used with care. In learning theory, we usually talk about the *capacity* of function classes, which can be measured by different means, such as the VC-dimension or the Rademacher average.

Given a supervised learning problem with a fixed distribution of the random pair of inputs $X\in\X$ and labels $Y\in\Y$, one can define the optimal target model $t$ as the one with minimal risk $$ R^* = R(t) = \inf_{f : \X\rightarrow \Y} \ R(f) . $$ In this definition, the minimum is considered over all possible functions from the input space $\X$ to the label space $\Y$.

Another important model when considering algorithms working with a restricted function class $\F$ is the best model within the class: $$ f^* = \arg\min_{f\in \F} \ R(f) . $$

With these definitions, the excess in error of a model $f$ can be decomposed as
$$
R(f) - R^* = \underbrace{\left[ R(f) - R(f^*)\right]}_{estimation\ error} + \underbrace{\left[ R(f^*) - R^* \right]}_{approximation\ error},
$$
where the **estimation error** reflects the inability of the learning algorithm to find the best model $f^*$ in the class $\F$, while the **approximation error** is induced by the limited capacity of the chosen function class $\F$.

Given a training set, the learning algorithm returns a model $f$ in the restricted function class $\F$.

The excess in error of the learned model $f$ decomposes into the sum of the estimation error (between $f$ and the best model $f^*$ in the class $\F$) and the approximation error (between $f^*$ and the target model $t$).