The perceptron

In words...

The perceptron was one of the first learning algorithm for binary classification. It is a simple algorithm in the family of linear classifiers.

To classify an input pattern, i.e., assign a label or the other to it, the perceptron computes a weighted sum of the inputs and compares this sum to a threshold. The parameters that are learned from the training set are the weights of the sum and the threshold, which are related to the normal vector and the offset of the corresponding separating hyperplane (see also the linear classification page).

The training algorithm of the perceptron works online: the model is updated by taking into account a single data instance. Applying this update for all data in the training set results in a so-called epoch of the algorithm. Training a perceptron typically requires multiple epochs.

The small updates mentioned above are made so that the classifier gets closer to a correct classification on the given data instance. This also means that no updates are performed for examples that are already well-classified.

Therefore, the algorithm terminates when all data are correctly classified and no more updates occur. At this point, we say that the algorithm converged. Note that this also implies that the algorithm does not converge when applied to a non-linearly separable data set. This is why in practice we have to set a maximum number of epochs for training; this ensures that the algorithm terminates.

In pictures...

Here is an example data set in two dimensions. You can see the step-by-step training of the perceptron by clicking the buttons below.

You can also add data points to see how the algorithm reacts by clicking in the plot and using the mouse wheel to change the label (color) of the added point.

Classifier parameters :
$\g w = $
$b = $

Exercise: Experimentally observe the speed of convergence of the perceptron with respect to the margin of the data set. For instance, you can add points within the margin and run the algorithm again from scratch (after a reset) after each point addition. The number of required epochs should increase.

Demo application

You can see the perceptron in action in the pattern recognition demo.

In maths...

Classification function

For input vectors $\g x\in\R^d$, the decision rule implemented by a perceptron is $$ f(\g x) = \begin{cases} +1, & \mbox{if } \g w^T \g x \geq b \\ -1, & \mbox{otherwise} \end{cases} $$ with the trainable parameters $\g w\in \R^d$ (called the weights) and $b\in\R$ (note the slight difference with the definition of a linear classifier given in the linear classification page: the bias term $b$ should be $-b$ to recover a similar formula).

Training algorithm

The training algorithm loops over all the data $(\g x_i,y_i)$ in the training set to which it applies the following update rule: $$ \g w \leftarrow \begin{cases}\g w,\ &\mbox{if }f(\g x_i) = y_i\\ \g w + y_i\g x_i,\ &\mbox{otherwise } \end{cases} $$ in the case of a bias term fixed to $b=0$.

The rationale behind this update rule is that if the considered example $(\g x_i,y_i)$ is correctly classified, then no update is required, while in the case of a misclassification, the update makes the prediction closer to a correct one.
To see this, note that if the example is misclassified, then the updated classifier computes a prediction based on the positivity of $$ (\g w + y_i \g x_i)^T \g x_i = \g w^T \g x_i + y_i \g x_i^T\g x_i = \g w^T \g x_i + y_i \| \g x_i \|_2^2 . $$ If $f(\g x_i) \neq y_i = +1$, then $(\g w + y_i \g x_i)^T x_i \geq \g w^T \g x_i$, which is desired since the dot product $\g w^T \g x_i$ is negative and needs to increase in order to produce a correct classification.
Conversely, if $f(\g x_i) \neq y_i = -1$, then $(\g w + y_i \g x_i)^T x_i \leq \g w^T \g x_i$, which is desired since the dot product $\g w^T \g x_i$ is positive and needs to decrease in order to produce a correct classification.

Fixing the bias term to $b=0$ leads to a separating hyperplane that goes through the origin. In order to learn the bias term and relax this constraint, we can apply the same algorithm to extended input vectors $\tilde{\g x}_i=[\g x_i^T, -1]^T$ and parameter vector $\tilde{\g w} = [\g w^T, b]^T$, since $$ \tilde{\g w}^T\tilde{\g x}_i \geq 0\ \Leftrightarrow\ \g w^T \g x_i - b \geq 0 \ \Leftrightarrow\ \g w^T \g x_i \geq b . $$ This yields the update rules $$ \g w \leftarrow \begin{cases}\g w,\ &\mbox{if }f(\g x_i) = y_i\\ \g w + y_i \g x_i,\ &\mbox{otherwise } \end{cases} \quad \mbox{and } \quad b \leftarrow \begin{cases}b, \ &\mbox{if }f(\g x_i) = y_i\\ b - y_i,\ &\mbox{otherwise }. \end{cases} $$

Convergence of the training algorithm

The training procedure of the perceptron stops when no more updates occur over an epoch, which corresponds to the obtention of a model classifying correctly all the training data. As such, the algorithm cannot converge on non-linearly separable data sets. For such cases, the implementation should include a maximum number of epochs.

For linearly separable training sets, the perceptron can be shown to converge with a number of iterations (more precisely, of effective updates) bounded by the ratio $$ \frac{R^2}{\gamma^2} $$ where $R$ is the radius of the enclosing $\ell_2$-ball of the data set and $\gamma$ its margin.

Exercise: Show that, if there exists a perceptron $f^*$ of parameters $(\g w^*, b^*)$ such that $\|\g w^*\| = 1$, $b^*=0$, and $f(\g x_i ) = y_i$ with $|{\g w^*}^T \g x_i| \geq \gamma$, $i=1,\dots,N$, then the perceptron algorithm initialized with $\g w^0=0$, $b^0=0$, converges after no more than $R^2 / \gamma^2$ updates of the weights $\g w$ ($b$ is left unchanged).

This can be shown with the following steps.
  1. Show that for all effective update, we have $$ {\g w^*}^T \g w^{k+1} \geq {\g w^*}^T \g w^{k} + \gamma , $$ where $\g w^k$ is the weight vector after $k$ updates.

  2. Use the result above to derive a lower bound on ${\g w^*}^T \g w^{k}$ that involves only $k$ and $\gamma$.

  3. Show that for all effective update, we have $$ \|\g w^{k+1} \|^2 \leq \|\g w^{k} \|^2 + R^2 $$

  4. Use the inequality above to derive an upper bound on $\|\g w^{k}\|^2$ that involves only $k$ and $R$.

  5. Conclude that the number of iterations cannot exceed $R^2/\gamma^2$.

Tricks and practical implementation

In order to improve the convergence, a learning rate $\mu\in(0,1)$ is typically included in the update rule, which becomes: $$ \g w \leftarrow \begin{cases}\g w,\ &\mbox{if }f(\g x_i) = y_i\\ \g w + \mu y_i \g x_i,\ &\mbox{otherwise } \end{cases} \quad \mbox{and } \quad b \leftarrow \begin{cases}b, \ &\mbox{if }f(\g x_i) = y_i\\ b - \mu y_i,\ &\mbox{otherwise }. \end{cases} $$ To see how this helps, consider the updates of the bias term $b$, which are always $+1$ or $-1$ in the original algorithm. Therefore, the possible values for $b$ are very limited. Though this does not constrain the set of possible classifiers (since there are an infinite set of parameters $(\g w,b)$ leading to the same separating hyperplane as explained here), introducing the learning rate can increase the speed of convergence towards a satisfactory solution.