No free lunch theorems

In words...

No free lunch theorems state that given a finite training sample, for any learning algorithm there is some distribution of the data for which it performs poorly. Note that this also holds for consistent learning algorithms which asymptotically yield the optimal model.

In pictures...

In maths...

No-free-lunch theorems consider the average risk $\E_{D_N} R(f)$ of algorithms over all possible training samples $D_N$ of a given size $N$.

No free lunch for binary classification

For binary classification, we have the following no-free-lunch theorem.

Let $\epsilon > 0$ be an arbitrarily small number. For any integer $N$ and learning algorithm $\mathcal{A} : (\X\times\Y)^N \rightarrow \Y^{\X}$, there exists a distribution of $(X,Y)$ with Bayes risk $R^* = 0$ such that $$ \E_{D_N} R(\mathcal{A} ( D_N ) ) \geq \frac{1}{2} - \epsilon $$ with $D_N = ( X_i, Y_i )_{1\leq i \leq N}$ the training sample.

The proof works by constructing a family of distributions such that any classifier has a large error on one distribution of the family.
  1. Step 1: constructing the family of distributions of $(X,Y)$ with $R^*=0$. Let the distribution of $X$ be a discrete one with $P(X = i) = 1/K$ if $i\in\{1,\dots, K\}$ and 0 otherwise for some integer $K$. The distribution of $Y$ is parametrized by a number of $b\in[0,1)$ whose binary expansion is $0.b_1b_2\dots$ with $b_i \in \{0,1\}$. Given that $X$ takes only integer values, we can define $Y = 2 b_X -1$, i.e., the label $Y$ depends on the bit of index $X$ in the binary expansion of $b$.
    With $b$ fixed, $Y$ is a function of $X$ and the problem characterized above is deterministic with a Bayes risk of $R^*=0$ obtained with the classifier $t(X)= 2b_X -1$.
  2. Step 2: randomize the parameter $b$. Consider now that $b$ is a realization of a random variable $B$ uniformly distributed in $[0,1)$ and independent of both $X$ and the training sample $D_N$. Let $$ R_b = \E_{D_N} R(f_N) = \E_{D_N} P_{X,Y} (f_N(X) \neq Y \ |\ D_N, B=b) $$ denote the average risk of the learning algorithm for the distribution parametrized with $b$. Then, the existence of a distribution leading to a large error depends on the supremum $\sup_{b\in[0,1)} R_b$, that can be lower bounded by the expectation: $$ \sup_{b\in[0,1)} R_b \geq \E_B R_B . $$ We have \begin{align} \E_B R_B &= \E_{B,D_N} P_{ X, Y}\left\{ f_N ( X) \neq Y\ |\ D_N , B\right\} \\ &= P_{B, D_N, X, Y}\left\{ f_N ( X) \neq Y \right\} \\ &= P_{B, D_N, X}\left\{ f_N ( X) \neq 2 B_X -1 \right\} \\ &= \E_{X, D_N} P_{B | X, D_N}\left\{ f_N ( X) \neq 2 B_X -1 \ |\ X, D_N \right\} \\ \end{align} Since $X$ is discrete, $D_N$ is also discrete and the expectation $\E_{X,D_N}$ is a weighted sum over all possible combinations of their values. Since all the terms in this sum are positive, we can lower bound the expectation by summing over a subset of them. With $A\subset \X, d_n \in (\X\times\Y)^N$, we have \begin{align} \E_B R_B &= \sum_{x \in \X, d_n \in (\X\times\Y)^N } P_{B | X, D_N}\left\{ f_N ( X) \neq 2 B_X -1 \ |\ X, D_N \right\} P\{X=x, D_N = d_n\} \\ &\geq \sum_{ (x, d_n) \in A } P_{B | X, D_N}\left\{ f_N ( X) \neq 2 B_X -1 \ |\ X, D_N \right\} P\{ X=x, D_N=d_n \} \end{align} In particular, consider the set $A$ of all cases where $X\neq X_i$ for all $i=1,\dots,N$. Since $B$ is uniformly distributed in $[0,1)$, its binary expansion coefficients form a sequence of independent random variables with $P(B_i = 0) = P(B_i = 1) = 1/2$, and thus $Y=B_X$ is conditionally independent of $X$ and $D_N$ given $(X,D_N)\in A$. For such cases, $f_N(X)$ takes either the value $+1$ or $-1$ deterministically and we have $$ P_{B | X, D_N}\left\{ f_N ( X) \neq 2 B_X -1 \ |\ X, D_N \right\} = \frac{1}{2} \I{f_N(X) \neq 1} + \frac{1}{2} \I{f_N(X) \neq -1} = \frac{1}{2}, $$ resulting in \begin{align} \E_B R_B &\geq \frac{1}{2} \sum_{ (x, d_n) \in A } P\{ X=x, D_N=d_n \} \\ &= \frac{1}{2} P\{ (X, D_N) \in A \} \\ &= \frac{1}{2} P\{ X \neq X_1, X\neq X_2, \dots, X\neq X_N \} \\ &= \frac{1}{2} \E_X P\{ X \neq X_1, X\neq X_2, \dots, X\neq X_N \ |\ X \} \\ &= \frac{1}{2} \E_X \left[ \left( P( X \neq X_1\ |\ X) \right)^N \right]\\ &= \frac{1}{2} \E_X \left[ \left( 1 - P( X_1 = X \ |\ X) \right)^N\right] \\ &= \frac{1}{2} \E_X \left[ \left( 1 - \frac{1}{K} \right)^N \right]\\ &= \frac{1}{2} \left( 1 - \frac{1}{K} \right)^N \end{align} Finally, this yields the lower bound $$ \sup_{b\in[0,1)} R_b \geq \frac{1}{2} \left( 1 - \frac{1}{K} \right)^N $$ which tends to $1/2$ when $K\rightarrow +\infty$. Thus, for any $\epsilon>0$, there is a $K$ such that the statement of the theorem holds.