Parametrized model spaces and hyperparameters

In words...

In a machine learning problem characterized by a given input space and a given label space, a predictive model is a function that assigns a label to every possible input.

Instead of searching for the model in the set of all such functions, most learning algorithms consider a restricted model space or a function class.

In machine learning, we often encounter parameterized function classes for which the class of functions itself is defined up to the value of some parameters. A typical example is the class of polynomials parametrized by the degree of the polynomials.

In order to distinguish between the parameters of the model and the parameters of the model class, we call the latter hyperparameters. This emphasizes the high-level nature of these parameters that define the structure of the model, in comparison with the model parameters that are often numerical values without much meaning.

The distinction between parameters and hyperparameters is of primary importance in order to understand the overall learning procedure. Indeed, while the values of the parameters are automatically determined by the learning algorithm, the hyperparameters must be tuned by the user. It is therefore critical to understand their meaning and influence on the resulting model.

In practice, finding the optimal value for a hyperparameter is a difficult task, known as model selection and which typically relies on an estimation of the risk, for instance with an additional data sample or a cross-validation procedure. Note that when using an additional sample to tune hyperparameters, this sample is called a validation sample in order to distinguish it from the test sample that is used to estimate the risk of the model at the end of the learning procedure.

In pictures...

In maths...

Given an input space $\X$ and a label space $\Y$, a predictive model is a function $$ f : \X\rightarrow \Y $$ that maps any input vector $\g x$ to a label $f(\g x)$.

Instead of searching for a model in the set of all such function $\Y^{\X}$, most learning algorithms consider a restricted model space $$ \F \subset \Y^\X, $$ also called a function class.

Parametrized function classes are defined up to the value of some parameter. A popular example of such a function class is the class of polynomial functions, which for an input space $\X\subseteq\R$ and a label space $\Y\subseteq\R$, can be written as $$ \F^n = \left\{ f\in\Y^\X \ :\ f(x) = w_0 + w_1 x + \dots + w_n x^n,\ (w_0,\dots,w_n)\in\R^{n+1} \right\} . $$ Here, the parameters of the model that have to be learned from data are the $w_k$'s while $n$ is the hyperparameter that defines the maximal degree of all the polynomials in $\F^n$. When learning a model from this class, one has to fix $n$ to a suitable value depending on the problem at hand.