# Consistency

## In words...

Consistency is one of the first and least requirement one may ask from a learning algorithm.

Consistency is an asymptotic property of a learning algorithm, which basically ensures that the algorithm returns better models when given more training data.

## In pictures...

## In maths...

### Weak and strong consistency

A learning algorithm returning a model $f_N$ when applied to a training sample $D = \{(X_i, Y_i)\}_{i=1}^N \subset (\X\times \Y)^N$ of size $N$ is said to be **consistent** with respect to a certain distribution of $(X,Y)$ if the risk of the returned model converges in probability to the Bayes risk, $R^* = \inf_{f : \X \rightarrow \Y} R(f)$, when $N$ goes to infinity:
$$
\forall \epsilon>0, \quad \lim_{N\rightarrow +\infty} P\left\{ D_N\ : \ R(f_N) - R^* > \epsilon \right\} = 0 .
$$
If the convergence is almost sure, i.e., if
$$
P\left\{ \lim_{N\rightarrow +\infty} R(f_N) = R^* \right\} = 1 ,
$$
then the learning algorithm is **strongly consistent**.

Note that strong consistency implies consistency. The difference between the two is as follows. On the one hand, consistency implies that the probability of drawing a training sample such that the risk of the returned model is close to the best achievable risk $R^*$ tends to 1 when the size $N$ of the training sample increases. On the other hand strong consistency implies that when the size of the training sample increases, the risks of the returned model tends towards the best possible for all possible training samples of this size (except a set of training samples with zero measure).

### Universal consistency

Consistency (weak or strong) of an algorithm as defined above only holds with respect to a particular distribution of $(X,Y)$ and thus with respect to a particular problem. But since one of the assumptions at the core of machine learning is that we do not have prior knowledge of this distribution, the consistency of algorithms does not offer much help to choose an algorithm for a particular task. Under such an agnostic assumption, what we need is to characterize algorithms according to whether they are consistent over a large set of possible distributions.

A learning algorithm is said to be **universally (strongly) consistent** if it is (strongly) consistent for any distribution of $(X,Y)$.