The maximum likelihood inference principle

In words...

The maximum likelihood inference principle is a general learning strategy that relies on assuming that the conditional probability of the label variable takes a particular form involving the predictive model. With this assumption, we define the likelihood of a predictive model with respect to a data set as the probability of the data set being generated by this model.

With these definitions, learning amounts to searching for the predictive model with maximum likelihood with respect to the training set.

Examples of classification algorithms based on this principle are the logistic regression and linear discriminant analysis. In the context of regression, the least squares method can also be recovered by application of this principle.

In pictures...

In maths...

Given a predictive model $f:\X\rightarrow \Y$, the definition of its likelihood relies on an assumption on the form of the conditional probability of the label given the input, i.e, that it can be expressed as $$ P( Y = y \ |\ X =x ) = h( y, f(\g x) ) , $$ for discrete random variables $X$ and $Y$ (the whole procedure extends to the continuous case by replacing probabilities by probability density functions).

Given this assumption, formalized by the function $h$, we define the likelihood of a predictive model $f:\X\rightarrow \Y$ with respect to a data set $\{(\g x_i, y_i)\}_{i=1}^N$ as \begin{align*} P^N ( & \{(X_i,Y_i) = (\g x_i, y_i)\}_{i=1}^N ) & \mbox{(first definition of the likelihood)}\\ &= \prod_{i=1}^N P ((X_i,Y_i) = (\g x_i, y_i)) & (\mbox{due to the independence of the random pairs } (X_i,Y_i) )\\ & =\prod_{i=1}^N P( Y_i = y_i \ |\ X_i =x_i ) P ( X_i = x_i) & \mbox{(due to the definition of the conditional probability)}\\ & = \prod_{i=1}^N h( y_i, f(\g x_i) ) P ( X_i = x_i) & \mbox{(due to the assumption on the model)} \end{align*}

The maximum likelihood principle amounts to searching for the model $f \in\F \subset \Y^\X$ that maximizes this likelihood, which can be expressed as $$ \hat{f} =\arg\max_{f\in\F} \ \prod_{i=1}^N h( y_i, f(\g x_i) ) P ( X_i = x_i)= \arg\max_{f\in\F} \ \sum_{i=1}^N \log h( y_i, f(\g x_i) ) + \log P ( X_i = x_i) , $$ since taking the log does not influence the argument of the maximum but only the value of the maximum itself. Similarly, we can remove the constant terms to obtain the simplified formulation $$ \hat{f} = \arg\max_{f\in\F} \ \sum_{i=1}^N \log h( y_i, f(\g x_i) ) . $$

Relationship with the ERM principle

We obtain an equivalence between the maximum likelihood and the ERM principles by choosing $$ h( y, \hat{y} ) =\frac{1 }{\gamma} \exp\left( - \ell(y,\hat{y})\right) , $$ where $\gamma = \int_\Y \exp\left( - \ell(y,\hat{y})\right) dy$ is a normalization constant introduced to guarantee that $h$ outputs a probability, i.e., sum to one. With this choice of $h$, the maximum likelihood principle leads to $$ \hat{f} = \arg\max_{f\in\F} \ \sum_{i=1}^N -\log \gamma - \ell(y,f(\g x_i)) $$ or, after removing the constants, $$ \hat{f} = \arg\max_{f\in\F} \ -\sum_{i=1}^N \ell(y,f(\g x_i)) . $$ Since maximizing a functional is equivalent to minimizing the same functional with opposite sign, we see that this exactly corresponds to the ERM principle.

Conversly, we can choose the loss function $$ \ell(y,\hat{y}) = -\log \gamma h(y, \hat{y} ) $$ in the ERM approach to recover the maximum likelihood formulation.