Linear classification refers to the case where the classifiers are based on linear (or affine) functions of the input. For instance, in binary classification, linear classifiers can be obtained by taking the sign of a linear function of the input.
Linear classifiers have an obvious algorithmic advantage over nonlinear ones due their simplicity. But, in addition, they also offer a (rather extreme but however partial) answer to the issue of overfitting.
A data set is said to be linearly separable if there exists a linear classifier that classify correctly all the data in the set. Note that a problem needs not be linearly separable for linear classifiers to yield satisfactory performance. However, more complex problems might call for nonlinear classification methods.
For binary problems, each linear classifier can be identified with a separating hyperplane: a flat surface dividing the input space in two halves. This hyperplane corresponds to the boundary between the two regions for which the classifier predicts either one of the two classes. In this case, checking if a data set is linearly separable can be cast as a linear programming problem.
Perhaps the most simple algorithm for linear classification in the binary case is the perceptron. The Support Vector Machine is also based on a linear classifier known as the optimal separating hyperplane.
A famous example of non-linearly separable data in dimension 2 is the data set made of binary features and labeled by the XOR (exclusive OR) function.
$x_1$ | $x_2$ | $y = 2 ( x_1 \oplus x_2) -1$ |
0 | 0 | -1 |
0 | 1 | +1 |
1 | 0 | +1 |
1 | 1 | -1 |
Here is a linearly separable data set in dimension 3. In this case, the hyperplane is a classical plane.
We restrain the discussion to binary classification, leaving multi-class problems aside. A binary classifier with output in $\Y=\{-1,+1\}$ can be obtained by taking the sign of a real-valued function: $$ f(\g x) = \sign( g(\g x) ) \qquad \mbox{with }\ g : \X\rightarrow \R $$ where $$ \sign(u) = \begin{cases} +1 , & \mbox{if } u \geq 0 \\ -1,&\mbox{otherwise.} \end{cases} $$
In a linear classifier, the associated real-valued function $g$ is linear (or affine) in $\g x$: $$ g(\g x) = \inner{\g w}{\g x} + b $$ with parameters $\g w \in \X$ and $b\in\R$ ($\inner{\cdot}{\cdot}$ denotes the inner product in $\X$). This function $g$ also defines a separating surface corresponding to the boundary between the two predicted classes of equation $$ g(\g x) = 0 $$ Given the linear form of $g$, this surface is a separating hyperplane $$ H = \{ \g x \in \X \ :\ \inner{\g w}{\g x} + b = 0\} $$
For one-dimensional problems ($\X \subseteq \R$ ), $H$ becomes the single point $x= -b/w$. In dimension $d=2$, $H$ is a straight line, which becomes a plane in dimension 3.
Exercise: Give a formal proof of the fact that the XOR data set is not linearly separable.
Exercise: Show that, for binary classification, a data set $\{(\g x_i,y_i)\}_{i=1}^N$ is linearly separable if the set of linear inequalities $$ y_i (\inner{\g w}{\g x_i} + b ) > 0,\ i=1,\dots,N $$ is feasible.
This exercise shows that linear separability can be checked by linear programming.