Linear classification

In words...

Linear classification refers to the case where the classifiers are based on linear (or affine) functions of the input. For instance, in binary classification, linear classifiers can be obtained by taking the sign of a linear function of the input.

Linear classifiers have an obvious algorithmic advantage over nonlinear ones due their simplicity. But, in addition, they also offer a (rather extreme but however partial) answer to the issue of overfitting.

A data set is said to be linearly separable if there exists a linear classifier that classify correctly all the data in the set. Note that a problem needs not be linearly separable for linear classifiers to yield satisfactory performance. However, more complex problems might call for nonlinear classification methods.

For binary problems, each linear classifier can be identified with a separating hyperplane: a flat surface dividing the input space in two halves. This hyperplane corresponds to the boundary between the two regions for which the classifier predicts either one of the two classes. In this case, checking if a data set is linearly separable can be cast as a linear programming problem.

Perhaps the most simple algorithm for linear classification in the binary case is the perceptron. The Support Vector Machine is also based on a linear classifier known as the optimal separating hyperplane.

In pictures...

Linear classification in dimension 1

In dimension 1, the data points (blue and red dots in the plot below) lie on a single axis. In this case, a linear classifier corresponds to a boundary between the classes that is just a point of the axis.

Linear classification in dimension 2

Here is a linearly separable data set in dimension 2. In this case, the hyperplane is just a straight line.

A famous example of non-linearly separable data in dimension 2 is the data set made of binary features and labeled by the XOR (exclusive OR) function.

$x_1$ $x_2$ $y = 2 ( x_1 \oplus x_2) -1$

0 0 -1

0 1 +1

1 0 +1

1 1 -1

Linear classification in dimension 3

Here is a linearly separable data set in dimension 3. In this case, the hyperplane is a classical plane.

Checking linear separability by linear programming

Draw your own data set by adding points to the plot below (change the label with the mouse wheel) and let the computer determine if it is linearly separable (the computer uses linear programming as described in the second excercise of the maths section).

In maths...

We restrain the discussion to binary classification, leaving multi-class problems aside. A binary classifier with output in $\Y=\{-1,+1\}$ can be obtained by taking the sign of a real-valued function: $$ f(\g x) = \sign( g(\g x) ) \qquad \mbox{with }\ g : \X\rightarrow \R $$ where $$ \sign(u) = \begin{cases} +1 , & \mbox{if } u \geq 0 \\ -1,&\mbox{otherwise.} \end{cases} $$

In a linear classifier, the associated real-valued function $g$ is linear (or affine) in $\g x$: $$ g(\g x) = \inner{\g w}{\g x} + b $$ with parameters $\g w \in \X$ and $b\in\R$ ($\inner{\cdot}{\cdot}$ denotes the inner product in $\X$). This function $g$ also defines a separating surface corresponding to the boundary between the two predicted classes of equation $$ g(\g x) = 0 $$ Given the linear form of $g$, this surface is a separating hyperplane $$ H = \{ \g x \in \X \ :\ \inner{\g w}{\g x} + b = 0\} $$

For one-dimensional problems ($\X \subseteq \R$ ), $H$ becomes the single point $x= -b/w$. In dimension $d=2$, $H$ is a straight line, which becomes a plane in dimension 3.

Linearly separable data sets

A data set $\{(\g x_i,y_i)\}_{i=1}^N$ is linearly separable if $$ \exists (\g w, b) \in \X\times \R,\ \mbox{such that } \ \sign(\inner{\g w}{\g x_i} + b ) = y_i ,\ i=1,\dots,N . $$

Exercise: Give a formal proof of the fact that the XOR data set is not linearly separable.

The proof works by contradiction. Assume there exists a linear classifier $f$ of parameters $\g w=[w_1,\ w_2]^T$ and $b$ that classifies correctly all the data. Then, this implies the following constraints on the parameters: $$ \begin{align} f([0,0]^T) = -1 \ &\Rightarrow\ &b &< 0\\ f ([0,1]^T) = +1 \ &\Rightarrow\ & w_2 + b &\geq 0\\ f([1,0]^T) = +1 \ &\Rightarrow\ & w_1 + b &\geq 0\\ f([1,1]^T) = -1 \ &\Rightarrow\ & w_1 + w_2 + b &< 0 \end{align} $$ Summing the second and third inequalities yields $w_1 + w_2 +2b \geq 0 \ \Rightarrow\ -w_1 -w_2 -2b \leq 0$. Summing this resulting inequality with the last of the four above then gives $-b < 0$, which is impossible due to $b<0$. Thus, the assumption cannot hold and such a linear classifier does not exist.

Exercise: Show that, for binary classification, a data set $\{(\g x_i,y_i)\}_{i=1}^N$ is linearly separable if the set of linear inequalities $$ y_i (\inner{\g w}{\g x_i} + b ) > 0,\ i=1,\dots,N $$ is feasible.

We have $$ y_i(\inner{\g w}{\g x_i} + b ) > 0\ \Rightarrow \ \sign(\inner{\g w}{\g x} + b ) = \sign( y_i ) $$ In addition, in binary classification, $y_i \in\{-1, +1\}$ and $y_i = \sign(y_i)$, which concludes the proof.

This exercise shows that linear separability can be checked by linear programming.