Model selection and the tuning of hyperparameters

In words...

Model selection refers to one of the most delicate tasks in machine learning: choosing the best hyperparameters for the given learning algorithm, the restricted model space or both.

The difficulty of this task is directly due to our inability to distinguish between a good and a bad predictive model with certainty. Indeed, this distinction depends on the generalization ability of the models, which, according to the main assumptions of supervised learning, cannot be measured exactly.

Thus, in practice, model selection typically relies on an estimation of the risk, for instance with an additional data sample or a cross-validation procedure. Note that when using an additional sample for model selection, this sample is called a validation sample in order to distinguish it from the test sample that is used to estimate the risk of the model at the end of the learning procedure.

A simple model selection abstract procedure can be described as follows. For all possible values of the hyperparameter to be tuned, apply the learning algorithm to train a model and estimate the risk of this model with one of the two methods above. Then, retain the value of the hyperparameter that led to the smallest risk.

Such a procedure becomes impractical when the hyperparameter can take a large or infinite number of values. In such cases, a small set of test values must be chosen before applying model selection, with the obvious drawback of possibly missing good values for the hyperparameter. The procedure is also quite demanding in situations with more than one hyperparameter, in which all combinations of values must be evaluated.

In maths...

Given a function class $\F_{\rho}$ parametrized by $\rho$ and a learning algorithm $\mathcal{A} : ( \F_{\rho} , \{(\g x_i, y_i)\}_{i=1}^N, \gamma ) \mapsto f \in \F_{\rho}$ parametrized by $\gamma$, model selection amounts to finding the values of the hyperparameters $\rho$ and $\gamma$ that minimize the risk of the resulting model $f$.

This task is intrinsically impossible since the risk cannot be computed exactly without access to the true data distribution, assumed to be unknown in supervised learning.