After applying a learning algorithm, one obtains a predictive model that can be used to make predictions. However, in order for this model to be usable one also needs to know how good its predictions are, i.e., one needs to estimate the risk of the model.
We introduced the word estimate in the last sentence above since the risk is not a quantity that can be computed exactly (as it measures the quality of predictions of unknown phenomena). The most simple estimate of the risk is the test error. However, it is applicable only when we have access to a sufficiently large amount of data.
When the number of data is too small, dividing them into two sets results in one (if not both) of the training and test sets that becomes unrepresentative of the true data distribution. In this case, one typically resorts to one of the two cross-validation procedures presented below.
The $K$-fold cross-validation procedure starts by dividing the original training set into $K$ subsets (typically 5, 7 or 10). Then, for each one of these subsets left aside, we apply the learning algorithm to train a model on the other $(K-1)$ subsets and test it on the subset left aside. After each test, we memorize the corresponding test error and we compute the $K$-fold estimate of the risk at the end of procedure as the average of these test errors.
The Leave-One-Out (LOO) cross-validation procedure simply corresponds to the limit case of the $K$-fold method when the number of subsets, $K$, coincides with the number of data. The LOO procedure can be more accurate than the $K$-fold one, but implies much more computations. However, in cases where the number of data is very small, the $K$-fold method can fail due to training subsets of insufficient size.
Iteration | 1 | 2 | 3 | 4 | 5 |
Error |
$$ f_j = \mathcal{A}(D \setminus S_j ) ,\quad j=1,\dots,K, $$
and test the resulting model against $S_j$ to compute $$ R_j = \frac{1}{|S_j|} \sum_{(\g x_i, y_i) \in S_j} \ell ( y_i, f_j(\g x_i) ) . $$ Then, the $K$-fold cross validation estimate of the risk is the average error over the $K$ subsets: $$ R_{K-fold}(f) = \frac{1}{K} \sum_{j=1}^K R_j . $$