The perceptron

In words...

The perceptron was one of the first learning algorithm for binary classification. It is a simple algorithm in the family of linear classifiers.

To classify an input pattern, i.e., assign a label or the other to it, the perceptron computes a weighted sum of the inputs and compares this sum to a threshold. The parameters that are learned from the training set are the weights of the sum and the threshold, which are related to the normal vector and the offset of the corresponding separating hyperplane (see also the linear classification page).

The training algorithm of the perceptron works online: the model is updated by taking into account a single data instance. Applying this update for all data in the training set results in a so-called epoch of the algorithm. Training a perceptron typically requires multiple epochs.

The small updates mentioned above are made so that the classifier gets closer to a correct classification on the given data instance. This also means that no updates are performed for examples that are already well-classified.

Therefore, the algorithm terminates when all data are correctly classified and no more updates occur. At this point, we say that the algorithm converged. Note that this also implies that the algorithm does not converge when applied to a non-linearly separable data set. This is why in practice we have to set a maximum number of epochs for training; this ensures that the algorithm terminates.

In pictures...

Here is an example data set in two dimensions. You can see the step-by-step training of the perceptron by clicking the buttons below.

You can also add data points to see how the algorithm reacts by clicking in the plot and using the mouse wheel to change the label (color) of the added point.

Classifier parameters :
$\g w = $
$b = $

Exercise: Experimentally observe the speed of convergence of the perceptron with respect to the margin of the data set. For instance, you can add points within the margin and run the algorithm again from scratch (after a reset) after each point addition. The number of required epochs should increase.

Demo application

You can see the perceptron in action in the pattern recognition demo.

In maths...

Classification function

For input vectors $\g x\in\R^d$, the decision rule implemented by a perceptron is $$ f(\g x) = \begin{cases} +1, & \mbox{if } \g w^T \g x \geq b \\ -1, & \mbox{otherwise} \end{cases} $$ with the trainable parameters $\g w\in \R^d$ (called the weights) and $b\in\R$ (note the slight difference with the definition of a linear classifier given in the linear classification page: the bias term $b$ should be $-b$ to recover a similar formula).

Training algorithm

The training algorithm loops over all the data $(\g x_i,y_i)$ in the training set to which it applies the following update rule: $$ \g w \leftarrow \begin{cases}\g w,\ &\mbox{if }f(\g x_i) = y_i\\ \g w + y_i\g x_i,\ &\mbox{otherwise } \end{cases} $$ in the case of a bias term fixed to $b=0$.

The rationale behind this update rule is that if the considered example $(\g x_i,y_i)$ is correctly classified, then no update is required, while in the case of a misclassification, the update makes the prediction closer to a correct one.
To see this, note that if the example is misclassified, then the updated classifier computes a prediction based on the positivity of $$ (\g w + y_i \g x_i)^T \g x_i = \g w^T \g x_i + y_i \g x_i^T\g x_i = \g w^T \g x_i + y_i \| \g x_i \|_2^2 . $$ If $f(\g x_i) \neq y_i = +1$, then $(\g w + y_i \g x_i)^T x_i \geq \g w^T \g x_i$, which is desired since the dot product $\g w^T \g x_i$ is negative and needs to increase in order to produce a correct classification.
Conversely, if $f(\g x_i) \neq y_i = -1$, then $(\g w + y_i \g x_i)^T x_i \leq \g w^T \g x_i$, which is desired since the dot product $\g w^T \g x_i$ is positive and needs to decrease in order to produce a correct classification.

Fixing the bias term to $b=0$ leads to a separating hyperplane that goes through the origin. In order to learn the bias term and relax this constraint, we can apply the same algorithm to extended input vectors $\tilde{\g x}_i=[\g x_i^T, -1]^T$ and parameter vector $\tilde{\g w} = [\g w^T, b]^T$, since $$ \tilde{\g w}^T\tilde{\g x}_i \geq 0\ \Leftrightarrow\ \g w^T \g x_i - b \geq 0 \ \Leftrightarrow\ \g w^T \g x_i \geq b . $$ This yields the update rules $$ \g w \leftarrow \begin{cases}\g w,\ &\mbox{if }f(\g x_i) = y_i\\ \g w + y_i \g x_i,\ &\mbox{otherwise } \end{cases} \quad \mbox{and } \quad b \leftarrow \begin{cases}b, \ &\mbox{if }f(\g x_i) = y_i\\ b - y_i,\ &\mbox{otherwise }. \end{cases} $$

Convergence of the training algorithm

The training procedure of the perceptron stops when no more updates occur over an epoch, which corresponds to the obtention of a model classifying correctly all the training data. As such, the algorithm cannot converge on non-linearly separable data sets. For such cases, the implementation should include a maximum number of epochs.

For linearly separable training sets, the perceptron can be shown to converge with a number of iterations (more precisely, of effective updates) bounded by the ratio $$ \frac{R^2}{\gamma^2} $$ where $R$ is the radius of the enclosing $\ell_2$-ball of the data set and $\gamma$ its margin.

Exercise: Show that, if there exists a perceptron $f^*$ of parameters $(\g w^*, b^*)$ such that $\|\g w^*\| = 1$, $b^*=0$, and $f(\g x_i ) = y_i$ with $|{\g w^*}^T \g x_i| \geq \gamma$, $i=1,\dots,N$, then the perceptron algorithm initialized with $\g w^0=0$, $b^0=0$, converges after no more than $R^2 / \gamma^2$ updates of the weights $\g w$ ($b$ is left unchanged).

This can be shown with the following steps.

Show that for all effective update, we have $$ {\g w^*}^T \g w^{k+1} \geq {\g w^*}^T \g w^{k} + \gamma , $$ where $\g w^k$ is the weight vector after $k$ updates.

We have $\g w^{k+1} = \g w^k + y_i x_i$ and thus $$ {\g w^*}^T \g w^{k+1} = {\g w^*}^T \g w^k + y_i {\g w^*}^T \g x_i . $$ On the other hand, $f^*$ correctly classifies $\g x_i$ such that if $y_i = +1$, then ${\g w^*}^T \g x_i \geq \gamma$, while if $y_i=-1$, then ${\g w^*}^T \g x_i \leq -\gamma$. Hence, in both cases, we have $y_i {\g w^*}^T \g x_i \geq \gamma$.
Use the result above to derive a lower bound on ${\g w^*}^T \g w^{k}$ that involves only $k$ and $\gamma$.

Using the previous inequality, we can get the bound by writing $$ {\g w^*}^T \g w^{k} \geq {\g w^*}^T\g w^{k-1} + \gamma \geq {\g w^*}^T\g w^{k-2} + 2\gamma \geq \dots \geq {\g w^*}^T\g w^{k-k} + k\gamma $$ With the initialization $\g w^0=0$, we obtain the lower bound $$ {\g w^*}^T\g w^{k} \geq k \gamma . $$ A formal proof works by induction. The bound obviously holds at $k=0$ due to the initialization $\g w^0= 0$. Now, assume it holds for all iterations up to an arbitrary $k$, i.e., that ${\g w^*}^T \g w^{k} \geq k \gamma$. Applying the inequality of the first step yields $$ {\g w^*}^T \g w^{k+1} \geq {\g w^*}^T \g w^{k} + \gamma \geq k\gamma + \gamma = (k+1)\gamma $$ and the bound also holds at $k+1$. Thus, the bound holds for any $k\geq 0$.
Show that for all effective update, we have $$ \|\g w^{k+1} \|^2 \leq \|\g w^{k} \|^2 + R^2 $$

The left-hand side of the inequality can be computed as \begin{align} \|\g w^{k+1} \|^2 &= \|\g w^{k} + y_i\g x_i \|^2 = (\g w^{k} + y_i\g x_i)^T (\g w^{k} + y_i\g x_i) = {\g w^{k}}^T\g w^{k} + (y_i\g x_i)^T(y_i\g x_i) + 2{\g w^{k}}^T y_i\g x_i \\ &= \|\g w^{k}\|^2 + \|y_i \g x_i \|^2 + 2y_i {\g w^{k}}^T \g x_i \end{align} If an update occurs, this means that $\g x_i$ is misclassified by $\g w^k$ and that $y_i$ and ${\g w^{k}}^T \g x_i $ have opposite signs. Thus, $y_i {\g w^{k}}^T \g x_i \leq 0$ and $$ \|\g w^{k+1} \|^2 \leq \|\g w^{k}\|^2 + \|y_i\g x_i \|^2 $$ Since $y_i\in\{+1,-1\}$, we have $\|y_i\g x_i \|^2 = \|\g x_i \|^2 \leq R^2$. Therefore $$ \|\g w^{k+1} \|^2 \leq \|\g w^{k}\|^2 + R^2 . $$
Use the inequality above to derive an upper bound on $\|\g w^{k}\|^2$ that involves only $k$ and $R$.

Using the previous inequality, we can get the bound by writing $$\|\g w^k\|^2 \leq \|\g w^{k-1}\|^2 + R^2 \leq \|\g w^{k-2}\|^2 + 2R^2 \leq \dots \leq \|\g w^0\|^2 + k R^2 . $$ With the initialization $\g w^0=0$, this yields $$ \|\g w^k\|^2 \leq k R^2 , $$ which can be proved by induction as in step 2.
Conclude that the number of iterations cannot exceed $R^2/\gamma^2$.

The lower bound of Step 2 yields $$ k\gamma \leq {\g w^*}^T \g w^{k} \leq \| {\g w^*} \| \| \g w^{k} \| , $$ where we used the Cauchy-Schwarz inequality, $u^T v \leq \|u\| \|v\|$. Using the fact that $\| {\g w^*} \|=1$, this gives $k\gamma \leq \| \g w^{k} \|$ and $$ k^2\gamma^2 \leq \| \g w^{k} \|^2. $$ Plugging the inequality of Step 4 then leads to $$ k^2\gamma^2 \leq k R^2 $$ and $$ k \leq \frac{R^2}{\gamma^2} $$

Tricks and practical implementation

In order to improve the convergence, a learning rate $\mu\in(0,1)$ is typically included in the update rule, which becomes: $$ \g w \leftarrow \begin{cases}\g w,\ &\mbox{if }f(\g x_i) = y_i\\ \g w + \mu y_i \g x_i,\ &\mbox{otherwise } \end{cases} \quad \mbox{and } \quad b \leftarrow \begin{cases}b, \ &\mbox{if }f(\g x_i) = y_i\\ b - \mu y_i,\ &\mbox{otherwise }. \end{cases} $$ To see how this helps, consider the updates of the bias term $b$, which are always $+1$ or $-1$ in the original algorithm. Therefore, the possible values for $b$ are very limited. Though this does not constrain the set of possible classifiers (since there are an infinite set of parameters $(\g w,b)$ leading to the same separating hyperplane as explained here), introducing the learning rate can increase the speed of convergence towards a satisfactory solution.