Multi-Layer Perceptron (MLP)

In words...

The multi-layer perceptron (MLP) is a versatile learning machine from the family of neural networks which can be applied to binary classification, multi-class classification and regression.

An MLP is a model based on a layered network architecture with

Since backward connections from a layer to its predecessors are not allowed, an MLP is said to be a feed-forward neural network.

For binary classification, there is a single output unit returning a value between zero and one. The predicted label is then obtained by thresholding this value at one half. For multi-class classification, there is one output unit per category and the predicted label corresponds the category with maximal output. Finally, for regression, there is a single outptu unit directly providing the prediction.

Therefore, the number of units (or neurons) assigned to each layer is fixed by the problem dimensions for the input and output layers, but the number of neurons in the hidden layers is one of the main tunable hyperparameters of the MLP. While the number of hidden layers can also be freely chosen, we here restrict our attention to networks with a single layer. Architectures with more hidden layers are called deep networks and are typically trained in a different manner.

An MLP is trained is trained by adjusting the weights of the connections between the layers such that it minimizes the error on the training set. The procedure that allows to perform this minimization efficiently is called back-propagation. It relies on propagating the error from the output layer to the input layer in order to efficiently compute the derivatives of the error with respect to all the weights of the network.

In pictures...

Graphical representation of an MLP

Step-by-step training procedure

...

Regression example

...

In maths...

Model

Multi-Layer Perceptrons (MLP) handle multi-class classification problems with labels $\Y=\{1,\dots,Q\}$ via a one-of-$Q$ coding of these labels: a label $y\in\Y$ is encoded by a vector $$ \g y = \g e_{y} = \begin{bmatrix}0\\\vdots\\0\\1\\ 0\\ \vdots\\0\end{bmatrix} $$ of zeros with a single 1 at the $y$th component ($\g e_k$ is the $k$th unit vector of the canonical basis for $\R^Q$).

Then, an MLP is trained on a data set $\{(\g x_i, y_i) \}_{i=1}^N$ to output vectors $\g g(\g x_i) = [g_1(\g x_i), \dots, g_Q(\g x_i)]^T$ close to the label vectors $\g y_i = \g e_{y_i}$ and the decision rule is $$ f(\g x) = \arg\max_{y\in\Y} \g g_y(\g x) . $$

An MLP implements functions $\g g$ based on a layered network architecture with

Since backward connections from a layer to its predecessors are not allowed, an MLP is said to be a feed-forward neural network.

The number of units (or neurons) assigned to each layer is fixed by the problem dimensions ($d$ and $Q$) for the input and output layers. The number of neurons in the hidden layers is one of the main tunable hyperparameters of the MLP. While the number of hidden layers can also be freely chosen, we here restrict our attention to networks with a single layer. So-called deep architectures with more hidden layers are typically trained in a different manner.

The trainable parameters of the model are the weights of the connections between the different layers. Given a network with $n$ hidden neurons in the hidden layer, the output of this layer is computed as $\g h(\g x) = [h_1(\g x),\dots,h_n(\g x)]^T$, where $$ h_j(\g x) = \sigma_h( \g w_j^T \g x + w_{0j}),\quad j=1,\dots,n , $$ with parameters $\g W = [\g w_1, \g w_2, \dots, \g w_n]$ and $\g w_0 =[w_{01},\dots,w_{0n}]^T$. Similarly, the output layer computes $\g g(\g x)$ with $$ g_k(\g x) = \sigma_o( \g v_k^T \g h(\g x) + v_{0k}),\quad k=1,\dots, Q. $$

The functions $\sigma_h$ and $\sigma_o$ are called activation functions and are typically chosen as $$ \sigma_h( u ) = \tanh(u) $$ and $$ \sigma_o ( u ) = \frac{1}{1 + e^{-u}}\qquad \mbox{(a sigmoid function)} $$

Model for binary classification

For binary classification with $\Y=\{0,1\}$, we only need a single neuron in the output layer and the decision rule can be simplified to $$ f(\g x) = \I{g(\g x) \geq 0.5},\quad \mbox{with } \quad g(\g x) = \sigma_o(\g v^T\g h(\g x) + v_0) . $$

Model for regression

For regression, the output layer also contains a single neuron (unless we want to model a vector-valued quantity) and $\sigma_o$ is replaced by the identity, leading to an output layer simply performing a linear combination of the outputs of the hidden layer: $$ f(\g x) = g(\g x) = \g v^T\g h(\g x) + v_0 . $$

Training algorithm

An MLP is trained by minimizing the error over with respect to $\g g$. The error can be the sum of squared errors $$ J = \sum_{i=1}^N \| \g y_i - \g g(\g x_i)\|_2^2 , $$ which reduces to $$ J = \sum_{i=1}^N ( y_i - g(\g x_i))^2 $$ for binary classification and regression, or the cross-entropy $$ J = -\sum_{i=1}^N \sum_{k=1}^Q y_{ik} \log g_{k} ( \g x_i) = -\sum_{i=1}^N \log g_{y_i} ( \g x_i) , $$ which is not applicable to regression problems.

The minimization is typically performed by gradient descent over the weights of the network. The procedure that allows for an efficient gradient descent in an MLP is called back-propagation. It relies on the chain rule to efficiently compute the gradient of the error and all the required quantities.

First, decompose the error as a sum of pointwise error terms: $$ J = \sum_{i=1}^N J_i . $$ and notice that the gradient similarly decomposes as $$ \nabla J = \sum_{i=1}^N \nabla J_i . $$ Then, we apply a stochastic gradient descent, meaning that a single $J_i$ is considered at each step, as for the perceptron. To fix the notations, let $K = Q$ for multi-class classification and $K=1$ for binary classification and regression. We update the model by $$ \begin{cases} \g w_j \leftarrow \g w_j - \eta \drond{J_i}{\g w_j} & j=0,\dots, n,\\ \g v_k \leftarrow \g v_k - \eta \drond{J_i}{\g v_k} & k=1,\dots, K \end{cases} $$ For the output layer, the derivatives $\drond{J_i}{\g v_k}$ are readily computed. For instance, with squared loss, we have $$ \drond{J_i}{\g v_k} = \drond{J_i}{g_k(\g x_i)} \drond{g_k(\g x_i)}{\g v_k} = \underbrace{-2 (y_{ik} - g_k(\g x_i) ) \sigma_o^\prime( \g v_k^T \g h(\g x_i) + v_{0k})}_{\delta_{ik}}\ \g h(\g x_i), $$ where, for a sigmoid output fucntion $\sigma_o$, $\sigma_o^\prime( \g v_k^T \g h(\g x_i) + v_{0k}) = (1-g_k(\g x_i))g_k(\g x_i)$.
For the hidden layer, we apply the chain rule and obtain \begin{align} \drond{J_i}{\g w_j} &= \drond{J_i}{\g g(\g x_i)} \drond{\g g(\g x_i)}{h_j(\g x_i)} \drond{h_j(\g x_i)}{\g w_j} \\ &= \begin{bmatrix} -2 (y_{i1} - g_1(\g x_i) ) \\ \dots \\ -2 (y_{iK} - g_K(\g x_i) ) \end{bmatrix}^T \begin{bmatrix} \sigma_o^\prime( \g v_1^T \g h(\g x_i) + v_{01}) v_{1j}\\\dots\\\sigma_o^\prime( \g v_K^T \g h(\g x_i) + v_{0K}) v_{Kj}\end{bmatrix} \drond{h_j(\g x_i)}{\g w_j} \\ &= \left(\sum_{k=1}^K \delta_{ik} v_{kj} \right)\ \drond{h_j(\g x_i)}{\g w_j} \\ &= \left(\sum_{k=1}^K \delta_{ik} v_{kj} \right)\ \sigma_h^\prime( \g w_j^T \g x_i + w_{0j})\ \g x_i . \end{align} Therefore, the derivatives with respect to the weights of the hidden layer can be efficiently computed from the quantities $\delta_{ik}$ obtained during the computation of the derivatives for the output layer. In other words, the $\delta_{ik}$'s are by back-propagated from the output layer to the hidden layer. Since these contain the error terms $(y_{ik} - g_k(\g x_i))$ involving the network outputs $g_k(\g x_i)$, we obtain the following forward-backward algorithm:

As for the perceptron, the stochastic gradient descent loops over the entire training set with each iteration called an epoch. At each epoch $t$, the learning rate $\eta_t$ should be decreased.

Training with softmax output units and the cross-entropy loss

With the cross-entropy loss, the standard choice for the activation function of the output layer is the softmax function that returns normalized outputs in $[0,1]$ and that sum to one as $$ g_k(\g x) = \frac{e^{-(\g v_k^T\g h(\g x) + v_{0k})}}{\sum_{l=1}^K e^{-(\g v_l^T \g h(\g x) + v_{0l}) }} = \frac{e^{-z_k}}{\sum_{l=1}^K e^{-z_l}} , $$ with $z_k = \g v_k^T\g h(\g x) + v_{0k}$. We see that the $k$th output also involves the weights $\g v_l$ of the other outputs, but on the other hand, the pointwise error $J_i = \log g_{y_i}(\g x_i)$ only depends on the $y_i$th output. Hence, we have $$ \drond{J_i}{\g v_k} = \drond{J_i}{\g g(\g x_i)} \drond{\g g(\g x_i)}{z_k}\drond{z_k}{\g v_k} = \drond{J_i}{g_{y_i}(\g x_i)} \drond{g_{y_i}(\g x_i)}{z_k}\drond{z_k}{\g v_k} = \frac{-1}{g_{y_i}(\g x_i)} \drond{g_{y_i}(\g x_i)}{z_k}\g h(\g x_i), $$ where $$ \drond{g_{y_i}(\g x_i)}{z_k} = \begin{cases}  \frac{-e^{-z_{y_i}} \left(\sum_{l=1}^K e^{-z_l}\right) - e^{-z_{y_i}} (- e^{-z_k} )}{\left(\sum_{l=1}^K e^{-z_l}\right)^2} = g_{y_i}( g_k(\g x_i) - 1), & \mbox{if } k=y_i,\\ \frac{- e^{-y_i} (- e^{-z_k} )}{\left(\sum_{l=1}^K e^{-z_l}\right)^2} = g_{y_i}(\g x_i)g_k(\g x_i), & \mbox{if } k\neq y_i. \end{cases} $$ This yields $$ \drond{J_i}{\g v_k} = \begin{cases}  (1 - g_k(\g x_i) \g h(\g x_i), & \mbox{if } k=y_i,\\ -g_k(\g x_i)\g h(\g x_i), & \mbox{otherwise} \end{cases} = \underbrace{(y_{ik} - g_k(\g x_i) )}_{\delta_{ik}} \g h(\g x_i). $$

The back-propagation procedure then leads to \begin{align} \drond{J_i}{\g w_j}&= \drond{J_i}{\g g(\g x_i)} \drond{\g g(\g x_i)}{h_j(\g x_i)} \drond{h_j(\g x_i)}{\g w_j} \\ &= \drond{J_i}{g_{y_i}(\g x_i)} \drond{g_{y_i}(\g x_i)}{h_j(\g x_i)} \drond{h_j(\g x_i)}{\g w_j} \\ &= \drond{J_i}{g_{y_i}(\g x_i)} \left( \sum_{k=1}^K\drond{g_{y_i}(\g x_i)}{z_k} \drond{z_k}{h_j(\g x_i)} \right) \drond{h_j(\g x_i)}{\g w_j} \\ &= \left(\sum_{k=1}^K \delta_{ik} v_{kj} \right) \ \sigma_h^\prime( \g w_j^T \g x_i + w_{0j})\ \g x_i . \end{align}

Binary classification with the cross-entropy loss

In the binary case with $\Y=\{0,1\}$, the cross-entropy becomes $$ J =-\sum_{i=1}^N y_i \log g(\g x_i) = -\sum_{y_i = 1} \log g(\g x_i) $$ and we have (with a sigmoidal output unit), for $y_i=1$, \begin{align} \drond{J_i}{\g v} &= \drond{J_i}{g(\g x_i)} \drond{g (\g x_i)}{z} \drond{z}{\g v} \\ &= \frac{-1}{g(\g x_i)} \sigma_o^\prime(z) \g h(\g x_i) \\ &= \frac{-1}{g(\g x_i)} \frac{e^{-z}}{(1+e^{-z})^2} \g h(\g x_i) \\ &= \frac{-1}{g(\g x_i)} (1-g(\g x_i))g(\g x_i) \g h(\g x_i) \\ &= \underbrace{(g(\g x_i) - 1)}_{\delta_i} \g h(\g x_i) \end{align} and \begin{align} \drond{J_i}{\g w_j}&= \drond{J_i}{g(\g x_i)} \drond{g (\g x_i)}{z} \drond{z}{h_j(\g x_i)} \drond{h_j(\g x_i)}{\g w_j} \\ &= \delta_i v_j \ \sigma_h^\prime( \g w_j^T \g x_i + w_{0j})\ \g x_i . \end{align}

Training for regression

In a regression problem with $\Y\subseteq \R$, the squared loss is used. Since the output unit is just a linear combination with $\sigma_o(z) = z$ and $\sigma_o^\prime(z)=1$, the only change is for the output layer: \begin{align} \drond{J_i}{\g v} &= \drond{J_i}{g(\g x_i)} \drond{g (\g x_i)}{z} \drond{z}{\g v} \\ &=\underbrace{-2(y_i - g(\g x_i))}_{\delta_i} \g h(\g x_i) , \end{align} while, as before, \begin{align} \drond{J_i}{\g w_j}&= \drond{J_i}{g(\g x_i)} \drond{g (\g x_i)}{z} \drond{z}{h_j(\g x_i)} \drond{h_j(\g x_i)}{\g w_j} \\ &= \delta_i v_j \ \sigma_h^\prime( \g w_j^T \g x_i + w_{0j})\ \g x_i . \end{align}