In supervised learning, the computer is first given a training set of labeled data, i.e., of pairs of input patterns and corresponding labels.
Then, the goal is to **generalize**, i.e., predict the correct label of previsouly unseen patterns. To achieve this goal, the computer applies a learning algorithm on the training set to obtain a predictive model. An expected behavior of these learning algorithms is that the larger the training set is, the better the predictive model should be.

In the sentence above, "better" needs to be defined. In supervised learning, the quality of a model is measured in terms of the risk, also known as the **generalization error**.

The theory that allows us to study learning machines is known as statistical learning theory and formalizes the subject in probabilistic terms. The introduction of a probabilstic framework is rather natural given that we want to make statements about a problem dealing with "previously unseen" (and thus unknown) objects.

The first step of this formalization is to consider a data pair (a pattern with its label) as a realization of a pair of random variables with a joint probability distribution (a random pair). Then, the following assumptions are usually made.

- The process generating the data pairs does not evolve: all data pairs (the ones available for training as well as the unseen ones) are
**identically distributed**. - All data pairs are drawn at random without influence from the previously observed data: all data pairs are
**independent**. - The probability distribution of the data pairs is
**unknown**. If it was known, then we would have access to all the required information to compute the best predictions and there would be nothing left to learn. - We have access to a sample of data pairs: the training sample is the only information available for learning.

There is also an additional assumption rarely explicited but always needed: the smoothness assumption stating that the label of a pattern very close to another cannot behave very differently compared with the label of the other pattern.

Given that the probability distribution of the data is unknown, the risk of a model cannot be computed and thus we cannot select the best model with minimal risk: the model must be learned from data only.

Learning the data by heart is a very easy task for a computer. But with respect to the goal that is to generalize, learning by heart is a bad idea. As a simple illustration, consider you take a course and learn by heart the solutions to all exercises given in that course, but without understanding the course itself. Then it is most likely that you would fail the exam, for which you would have to generalize what you learned during the course to solve new problems; and learning by heart does not help you do that.

If learning by heart is not the right way to go, learning a model that outputs correct predictions for the training data remains a well-founded idea, which is behind the Empirical Risk Minimization (ERM) principle at the basis of many learning algorithms.

Two categories of supervised learning problems can be distinguished based on the nature of the labels. If the labels can be any real number, then the problem is a regression one. Alternatively, if the labels can only take a finite number of values, then this is a classification problem.

You can also have a look at the drawing recognition demo to play with these effects on your own.

Given a training set of $N$ labeled examples, i.e., of $N$ pairs of input vectors and labels, $$ \{ (\g x_1,y_1), \dots, (\g x_N,y_N) \} \in (\X\times \Y)^N, $$ the goal of supervised learning is to estimate a model $f : \X \rightarrow \Y$ that can output the prediction $f(\g x)$ of the label $y$ with minimum error.

In statistical learning theory, this problem is formalized in probabilistic terms as follows.

Let $(X,Y)$ be a random pair (a pair of random variables) taking values in $\X\times \Y$ with fixed but **unknown** joint probability distribution $P_{X,Y}(X,Y)$.

The **training sample**
$$
D = \{(X_i,Y_i)\}_{i=1}^N
$$
is a sample of $N$ independent and identically distributed copies of $(X,Y)$.

Given a realization of $D$, called the **training set**, the goal is to find the function $f : \X \rightarrow \Y$ that minimizes the expected error, called the risk.

Given that $P_{X,Y}(X,Y)$ is unknown, the risk cannot be computed nor minimized and we have to find a surrogate pinciple to learn $f$. The most straightforward of such principles is the Empirical Risk Minimization (ERM) principle, at the basis of many learning algorithms. Another popular principle is the maximum likelihood inference principle.