The bayes classifier is the theoretically optimal classifier for a given classification problem. This is why it is also called the target classifier: it is the classifier we aim at when using learning algorithms.
For a given input pattern, the Bayes classifier outputs the label that is the most likely, and thus provides a prediction that is the less likely to be an error compared with the other choices of label. Since the Bayes classifier applies this scheme for all possible inputs, this yields the smallest probability of error, a quantity also known as the Bayes' risk, i.e., the risk of the Bayes classifier which is by definition the smallest risk one can obtain for a given problem.
Note that the Bayes classifier requires knowledge of the class-membership probabilities, which are assumed unknown. This is why the Bayes classifier cannot be applied in practice. However, it plays an important role in the analysis of other learning algorithms.
The Bayes classifier is the best classifier among all possible classifiers. Another theoretically important classifier is the best classifier among a given set of classifiers.
Click on the picture to pick another $x$.
$P(Y = 1\ |\ X = x) =$
$P(Y = 2\ |\ X = x) =$
$P(Y = 3\ |\ X = x) =$
For this value of $x$, the Bayes classifier prediction is the most likely one, but it can still be wrong with probability . Averaging this probability of misclassification over all possible $x$ gives the Bayes' risk. In the case of a uniform distribution for $X$, the risk is proportional to the area between 1 and the pointwise maximal probability.
The Bayes classifier, denoted $t$, is defined by $$ \forall x\in\X , \quad t(x) = \arg\max_{y\in\Y} P(Y=y\ |\ X=x ) . $$ In the binary case where $\Y \in \{-1, +1\}$, this simplifies to $$ t(x) = \begin{cases} +1, & \mbox{if }\ P(Y= +1\ |\ X=x ) \geq P(Y= -1\ |\ X=x ) \\ -1, & \mbox{otherwise}. \end{cases} $$
The risk of this classifier, known as the Bayes' risk, is given by $$ R(t) = P(Y \neq t(x) ) = \E_X [ P(Y \neq t(x)\ |\ X=x ) ] $$ where the probability of misclassifying a given $x$ can be reformulated by the inversion principle as $$ P(Y \neq t(x)\ |\ X=x ) = 1 - P(Y = t(x)\ |\ X=x ) . $$ Since the Bayes classifier assigns the label with maximal probability, we also have $$ P(Y = t(x)\ |\ X=x ) = \max_{y\in\Y} P(Y=y\ |\ X=x ) $$ and thus $$ R(t) = \E_{X} [1 - \max_{y\in\Y} P(Y=y\ |\ X=x ) ] . $$
Note that in the deterministic case, for all $y$, $P(Y=y\ |\ X=x )\in\{0,1\}$, which implies that the Bayes' risk is zero (the Bayes classifier perfectly classifies all inputs).
We will now give a proof of the Bayes classifier optimality in binary classification, i.e., when $\Y = \{-1, +1\}$, as an exercise.
Exercise$$\I{f(x) = -1} = 1 - \I{f(x) = 1}$$
\begin{align} P(f(X) &\neq Y\ |\ X = x) \\ &= P( \{f(X) = 1, Y = -1 \} \cup \{ f(X) = -1, Y = 1\} \ |\ X=x) & (\{f(X) \neq Y\} \mbox{ is the union of two events}) \\ &= P(f(X) = 1, Y = -1 \ |\ X=x) + P(f(X) = -1, Y = 1\ |\ X=x) & ( \mbox{ these events are disjoint}) \\ &= P(f(X) = 1\ |\ X=x) P(Y = -1 \ |\ X=x) + P(f(X) = -1\ |\ X=x) P( Y = 1\ |\ X=x) & (f(X) = 1 \mbox{ and } Y= -1 \mbox{ are conditionally independent}) \\ &= \I{f(x) = 1} P(Y = -1 \ |\ X=x) + \I{f(x) = -1} P(Y= 1 \ |\ X=x) & ( \mbox{given } X=x,\ f(X) = 1 \mbox{ is deterministic}) \\ &= \I{f(x) = 1} [1 - P(Y = 1 \ |\ X=x)] + [ 1 - \I{f(x) = 1}]P(Y= 1 \ |\ X=x) & ( \mbox{by the law of total probability and question 1})\\ &= [1 - 2 P(Y= 1 \ |\ X=x)] \I{f(x) = 1} + P(Y= 1 \ |\ X=x) \end{align}
$$ t(x) = \begin{cases} +1, & \mbox{if } P(Y= +1\ |\ X=x ) \geq P(Y= -1\ |\ X=x ) \\ -1, & \mbox{otherwise}. \end{cases} $$ By the law of total probability, we have $P(Y= -1\ |\ X=x ) = 1 - P(Y= 1\ |\ X=x )$ and thus $$ t(x) = \begin{cases} +1, & \mbox{if } P(Y= +1\ |\ X=x ) \geq \frac{1}{2} \\ -1, & \mbox{otherwise}. \end{cases} $$