Cross-validation procedures

In words...

After applying a learning algorithm, one obtains a predictive model that can be used to make predictions. However, in order for this model to be usable one also needs to know how good its predictions are, i.e., one needs to estimate the risk of the model.

We introduced the word estimate in the last sentence above since the risk is not a quantity that can be computed exactly (as it measures the quality of predictions of unknown phenomena). The most simple estimate of the risk is the test error. However, it is applicable only when we have access to a sufficiently large amount of data.

When the number of data is too small, dividing them into two sets results in one (if not both) of the training and test sets that becomes unrepresentative of the true data distribution. In this case, one typically resorts to one of the two cross-validation procedures presented below.

The $K$-fold cross-validation procedure starts by dividing the original training set into $K$ subsets (typically 5, 7 or 10). Then, for each one of these subsets left aside, we apply the learning algorithm to train a model on the other $(K-1)$ subsets and test it on the subset left aside. After each test, we memorize the corresponding test error and we compute the $K$-fold estimate of the risk at the end of procedure as the average of these test errors.

The Leave-One-Out (LOO) cross-validation procedure simply corresponds to the limit case of the $K$-fold method when the number of subsets, $K$, coincides with the number of data. The LOO procedure can be more accurate than the $K$-fold one, but implies much more computations. However, in cases where the number of data is very small, the $K$-fold method can fail due to training subsets of insufficient size.

In pictures...

The leave-one-out method at work

Here, we illustrate the leave-one-out procedure on a classification problem where the number of data (30) is too small for other methods.

The $K$-fold method at work

Now, we illustrate the $K$-fold method on a regression problem with $K=5$.


In maths...

The $K$-fold method

To estimate the risk of the model returned by a learning algorithm $\mathcal{A}$ on a given a data set $D = \{(\g x_i, y_i)\}_{i=1}^N$, the $K$-fold method starts by partitioning the data set into $K$ subsets $S_1,\dots,S_K$ of the same size (approximately, if $N$ is not a multiple of $K$). Then, for each subset $S_j$, it applies the learning algorithm $\mathcal{A}$ to the training set made of all but this subset:

$$ f_j = \mathcal{A}(D \setminus S_j ) ,\quad j=1,\dots,K, $$

and test the resulting model against $S_j$ to compute $$ R_j = \frac{1}{|S_j|} \sum_{(\g x_i, y_i) \in S_j} \ell ( y_i, f_j(\g x_i) ) . $$ Then, the $K$-fold cross validation estimate of the risk is the average error over the $K$ subsets: $$ R_{K-fold}(f) = \frac{1}{K} \sum_{j=1}^K R_j . $$

The Leave-One-Out (LOO) method

The LOO method can be seen as the limit case of $K$-fold corresponding to $K=N$, i.e., when each subset contains a single data instance. Thus, the LOO estimate of the risk of $f = \mathcal{A}( D)$ can be compactly expressed as $$ R_{LOO}(f) = \frac{1}{N} \sum_{i=1}^N \ell \left( y_i, \mathcal{A}( D \setminus \{(\g x_i, y_i)\} ) ( \g x_i ) \right) . $$