In words...

In classification, the learning algorithm is used to discriminate between a finite number of categories of patterns. In this setting, the model is a classifier which predicts labels corresponding to the categories (often identified with $Q$ integers for a problem with $Q$ categories).

For example, we could build a classifier that given the picture of a fruit tells us if it is an apple or a banana. Such a classifier can be trained to recognize the fruits correctly from a set of labeled pictures.

When the number of categories is 2, then we have a binary classification problem, which we distinguish from multi-class problems that have more than 2 categories.

The typical loss function used in classification is the 0-1 loss, which merely assigns a cost of 1 to misclassifications and 0 to correct predictions. Given that loss function, the risk of a classifier corresponds to its probabilty of misclassifying a random pattern.

The optimal classifier minimizing this probability is the Bayes classifier, which assigns a pattern to the most likely category. But in order to do so, the classifier needs access to the probability distribution of the data, which is unknown by assumption. Therefore, we have to use learning algorithms that aim at this optimal classifier but will not be able to obtain it (unless we are given an infinite number of training data).

Perhaps the most straightforward algorithm for classification is the $K$-nearest-beighbors algorithm. Another simple and very famous algorithm at the origin of the family of neural networks is the perceptron. More recently, the Support Vector Machine became very popular.

Classification algorithms can be distinguished according to whether they build linear classifiers or nonlinear classifiers. Depending on how these classifiers are built, the algorithms are said to be discriminative (as the ones mentioned above) or generative.

In pictures...

A small pattern recognition demo

This is a simple demo of binary classification, in which the machine tries to recognize patterns drawn by hand. It shows how one can build a training set by labeling examples and then make predictions for pictures of unknown patterns.

This demo implements the following classifiers: $K$-nearest neighbors, the perceptron and the support vector machine.

Build a training set

Start by drawing a pattern and .
Then, draw a different pattern and .
Do this multiple times to collect a sample of drawings (of these 2 different patterns).

Make predictions

When you have enough data, draw a new example and ask the machine to or or .

At any time, you can or

Training set:

More examples

This other demo shows how a generative classifier known as the naive Bayes classifier can be used to filter spams in mailboxes.

In maths...

In classification, the labels belong to a finite set of categories, which can be identified with the set of integers $\Y = \{1,\dots,Q\}$. In this case, the most common choice is to use the 0-1 loss function, which is defined as the indicator function of the event corresponding to a misclassification: $$ \ell(y, \hat{y}) = \I{\hat{y} \neq y} . $$ Using this loss function, the risk of a classifier $f$ becomes the probability of misclassification by this classifier: $$ R(f) = \E_{X,Y} [ \I{f(X) \neq Y} ] = P_{X,Y} (f(X) \neq Y) $$ since it computes the expectation of the indicator function.

Binary classification

When $Q=2$ we have a binary classification problem and it is more convenient to use $Y\in\Y = \{-1,+1\}$. With this convention, any binary classifier $f : \X \rightarrow \{-1,+1\}$ can be defined via a real-valued function $g : \X \rightarrow \R$ as $$ f(\g x) = \sign( g (\g x) ) . $$ Many classification algorithms focus on learning $g$ rather than $f$ directly.