Estimating the risk with a test sample

In words...

After applying a learning algorithm, one obtains a predictive model that can be used to make predictions. However, in order for this model to be usable one also needs to know how good its predictions are, i.e., one needs to estimate the risk of the model.

We introduced the word estimate in the last sentence above since the risk is not a quantity that can be computed exactly (as it measures the quality of predictions of unknown phenomena). The most simple estimate of the risk is the test error.

The test error is the average loss computed on an independent and identically distributed data sample, called the test sample. Here, independence serves as a warranty. If the test sample is not independent of the training sample, then the warranty is broken and the test error cannot be considered as an estimate of the risk. In other words, to obtain a valid estimate of the risk, one should use additional data that was left aside during the whole training procedure, including the tuning of hyperparameters.

In practice, we are typically given a single data set that has to be randomly divided into a training set and a test set.

When the number of available data is too small, we cannot afford to leave aside a representative sample for the test. In this case, one typically resorts to a cross-validation procedure.

In pictures...

The test error in practice

Here, we illustrate the complete learning procedure on a classification problem.

In maths...

The risk is defined as the expectation of the loss function and thus cannot be computed without knowledge of the true data distribution. The test error is a simple estimate of the risk that can be computed from data.

Assume that we have access to both a training sample $$ D = \{(X_i,Y_i)\}_{i=1}^N $$ and a test sample $$ D_{test} = \{(X_i,Y_i)\}_{i=1}^{N_t} $$ of $N$ and $N_t$ independent and identically distributed copies of $(X,Y)$. Note that this also implies the independence of $D$ and $D_{test}$.

The procedure is as follows: we train a model $f$ on the training sample and then estimate its risk from the test sample by computing the test error $$ R_{test}(f) = \frac{1}{N_t} \sum_{i=1}^{N_t} \ell(Y_i, f(X_i) ) $$

The law of large numbers tells us that this estimate converges as $N_t$ tends towards infinity to the expectation of the loss function, i.e., to the risk of $f$.