The bayes classifier is the theoretically optimal classifier for a given classification problem. This is why it is also called the *target* classifier: it is the classifier we aim at when using learning algorithms.

For a given input pattern, the Bayes classifier outputs the label that is the most likely, and thus provides a prediction that is the less likely to be an error compared with the other choices of label. Since the Bayes classifier applies this scheme for all possible inputs, this yields the smallest probability of error, a quantity also known as the **Bayes' risk**, i.e., the risk of the Bayes classifier which is by definition the smallest risk one can obtain for a given problem.

Note that the Bayes classifier requires knowledge of the class-membership probabilities, which are assumed unknown. This is why the Bayes classifier cannot be applied in practice. However, it plays an important role in the analysis of other learning algorithms.

The Bayes classifier is the best classifier among all possible classifiers. Another theoretically important classifier is the best classifier among a given set of classifiers.

*Click on the picture to pick another $x$.*

$x = $

$P(Y = 1\ |\ X = x) =$

$P(Y = 2\ |\ X = x) =$

$P(Y = 3\ |\ X = x) =$

For this value of $x$, the Bayes classifier prediction is the most likely one, but it can still be wrong with probability . Averaging this probability of misclassification over all possible $x$ gives the Bayes' risk. In the case of a uniform distribution for $X$, the risk is proportional to the area between 1 and the pointwise maximal probability.

The Bayes classifier, denoted $t$, is defined by $$ \forall x\in\X , \quad t(x) = \arg\max_{y\in\Y} P(Y=y\ |\ X=x ) . $$ In the binary case where $\Y \in \{-1, +1\}$, this simplifies to $$ t(x) = \begin{cases} +1, & \mbox{if }\ P(Y= +1\ |\ X=x ) \geq P(Y= -1\ |\ X=x ) \\ -1, & \mbox{otherwise}. \end{cases} $$

The risk of this classifier, known as the Bayes' risk, is given by $$ R(t) = P(Y \neq t(x) ) = \E_X [ P(Y \neq t(x)\ |\ X=x ) ] $$ where the probability of misclassifying a given $x$ can be reformulated by the inversion principle as $$ P(Y \neq t(x)\ |\ X=x ) = 1 - P(Y = t(x)\ |\ X=x ) . $$ Since the Bayes classifier assigns the label with maximal probability, we also have $$ P(Y = t(x)\ |\ X=x ) = \max_{y\in\Y} P(Y=y\ |\ X=x ) $$ and thus $$ R(t) = \E_{X} [1 - \max_{y\in\Y} P(Y=y\ |\ X=x ) ] . $$

Note that in the deterministic case, for all $y$, $P(Y=y\ |\ X=x )\in\{0,1\}$, which implies that the Bayes' risk is zero (the Bayes classifier perfectly classifies all inputs).

We will now give a proof of the Bayes classifier optimality in binary classification, i.e., when $\Y = \{-1, +1\}$, as an exercise.

- Give a simple expression of the indicator function $\I{f(x) = -1}$ as a function of $\I{f(x) = 1}$ that is valid for any binary classifier $f : \X \rightarrow \Y$ and any $x\in\X$.

$$\I{f(x) = -1} = 1 - \I{f(x) = 1}$$

- Show that, for any binary classifier $f$ and any input $x$, the probability of misclassfying $x$ can be computed as $P(f(X) \neq Y\ |\ X = x) = [1 - 2 P(Y= 1\ |\ X=x)] \I{f(x) = 1} + P(Y= 1\ |\ X=x)$.

\begin{align} P(f(X) &\neq Y\ |\ X = x) \\ &= P( \{f(X) = 1, Y = -1 \} \cup \{ f(X) = -1, Y = 1\} \ |\ X=x) & (\{f(X) \neq Y\} \mbox{ is the union of two events}) \\ &= P(f(X) = 1, Y = -1 \ |\ X=x) + P(f(X) = -1, Y = 1\ |\ X=x) & ( \mbox{ these events are disjoint}) \\ &= P(f(X) = 1\ |\ X=x) P(Y = -1 \ |\ X=x) + P(f(X) = -1\ |\ X=x) P( Y = 1\ |\ X=x) & (f(X) = 1 \mbox{ and } Y= -1 \mbox{ are conditionally independent}) \\ &= \I{f(x) = 1} P(Y = -1 \ |\ X=x) + \I{f(x) = -1} P(Y= 1 \ |\ X=x) & ( \mbox{given } X=x,\ f(X) = 1 \mbox{ is deterministic}) \\ &= \I{f(x) = 1} [1 - P(Y = 1 \ |\ X=x)] + [ 1 - \I{f(x) = 1}]P(Y= 1 \ |\ X=x) & ( \mbox{by the law of total probability and question 1})\\ &= [1 - 2 P(Y= 1 \ |\ X=x)] \I{f(x) = 1} + P(Y= 1 \ |\ X=x) \end{align}

- Express the decision function $t(x)$ of the Bayes classifier as a function of $P(Y=1\ |\ X=x)$.

$$ t(x) = \begin{cases} +1, & \mbox{if } P(Y= +1\ |\ X=x ) \geq P(Y= -1\ |\ X=x ) \\ -1, & \mbox{otherwise}. \end{cases} $$ By the law of total probability, we have $P(Y= -1\ |\ X=x ) = 1 - P(Y= 1\ |\ X=x )$ and thus $$ t(x) = \begin{cases} +1, & \mbox{if } P(Y= +1\ |\ X=x ) \geq \frac{1}{2} \\ -1, & \mbox{otherwise}. \end{cases} $$

- Show that for all $f$ and all $x\in\X$, we have
$$
P(f(X) \neq Y\ |\ X=x) - P(t(X) \neq Y\ |\ X=x) \geq 0 .
$$

From question 2, we have \begin{align} P(f(X) \neq Y\ |\ X=x) - P(t(X) \neq Y\ |\ X=x) = & [1 - 2 P(Y= 1\ |\ X=x)] \I{f(x) = 1} + P(Y= 1\ |\ X=x) \\ & - [1 - 2 P(Y= 1\ |\ X=x)] \I{t(x) = 1} - P(Y= 1\ |\ X=x ) \\ = &[1 - 2 P(Y= 1\ |\ X=x)] \ [ \I{f(x) = 1} - \I{t(x) = 1} ] . \end{align} We wrote this quantity as a product of two terms, so that its sign depends on the signs of the two terms.

- If $1 - 2 P(Y= 1\ |\ X=x) > 0$, then $P(Y= 1\ |\ X=x) < \frac{1}{2}$ and, by question 3, $t(x) = -1$. Thus, $\I{t(x) = 1} = 0$ and in this case the second term, $\I{f(x) = 1} - \I{t(x) = 1} = \I{f(x) = 1} \in\{0,1\}$, is also positive due to the definition of the indicator function.
- If $1 - 2 P(Y= 1\ |\ X=x) < 0$, then $P(Y= 1\ |\ X=x) > \frac{1}{2}$ and, by question 3, $t(x) = +1$. Thus, $\I{t(x) = 1} = 1$ and in this case the second term, $\I{f(x) = 1} - \I{t(x) = 1} = \I{f(x) = 1} - 1 \in\{-1,0\}$, is also negative.

- Conclude by showing that for all $f$, $$R(f)\geq R(t) .$$ Hint: $\E_{X,Y}[ f(X,Y)] = \E_X \E_{Y\ |\ X=x}[ f(X,Y)\ |\ X=x] $.