The multi-layer perceptron (MLP) is a versatile learning machine from the family of neural networks which can be applied to binary classification, multi-class classification and regression.
An MLP is a model based on a layered network architecture with
For binary classification, there is a single output unit returning a value between zero and one. The predicted label is then obtained by thresholding this value at one half. For multi-class classification, there is one output unit per category and the predicted label corresponds the category with maximal output. Finally, for regression, there is a single outptu unit directly providing the prediction.
Therefore, the number of units (or neurons) assigned to each layer is fixed by the problem dimensions for the input and output layers, but the number of neurons in the hidden layers is one of the main tunable hyperparameters of the MLP. While the number of hidden layers can also be freely chosen, we here restrict our attention to networks with a single layer. Architectures with more hidden layers are called deep networks and are typically trained in a different manner.
An MLP is trained is trained by adjusting the weights of the connections between the layers such that it minimizes the error on the training set. The procedure that allows to perform this minimization efficiently is called back-propagation. It relies on propagating the error from the output layer to the input layer in order to efficiently compute the derivatives of the error with respect to all the weights of the network.
Multi-Layer Perceptrons (MLP) handle multi-class classification problems with labels $\Y=\{1,\dots,Q\}$ via a one-of-$Q$ coding of these labels: a label $y\in\Y$ is encoded by a vector $$ \g y = \g e_{y} = \begin{bmatrix}0\\\vdots\\0\\1\\ 0\\ \vdots\\0\end{bmatrix} $$ of zeros with a single 1 at the $y$th component ($\g e_k$ is the $k$th unit vector of the canonical basis for $\R^Q$).
Then, an MLP is trained on a data set $\{(\g x_i, y_i) \}_{i=1}^N$ to output vectors $\g g(\g x_i) = [g_1(\g x_i), \dots, g_Q(\g x_i)]^T$ close to the label vectors $\g y_i = \g e_{y_i}$ and the decision rule is $$ f(\g x) = \arg\max_{y\in\Y} \g g_y(\g x) . $$
An MLP implements functions $\g g$ based on a layered network architecture with
The number of units (or neurons) assigned to each layer is fixed by the problem dimensions ($d$ and $Q$) for the input and output layers. The number of neurons in the hidden layers is one of the main tunable hyperparameters of the MLP. While the number of hidden layers can also be freely chosen, we here restrict our attention to networks with a single layer. So-called deep architectures with more hidden layers are typically trained in a different manner.
The trainable parameters of the model are the weights of the connections between the different layers. Given a network with $n$ hidden neurons in the hidden layer, the output of this layer is computed as $\g h(\g x) = [h_1(\g x),\dots,h_n(\g x)]^T$, where $$ h_j(\g x) = \sigma_h( \g w_j^T \g x + w_{0j}),\quad j=1,\dots,n , $$ with parameters $\g W = [\g w_1, \g w_2, \dots, \g w_n]$ and $\g w_0 =[w_{01},\dots,w_{0n}]^T$. Similarly, the output layer computes $\g g(\g x)$ with $$ g_k(\g x) = \sigma_o( \g v_k^T \g h(\g x) + v_{0k}),\quad k=1,\dots, Q. $$
The functions $\sigma_h$ and $\sigma_o$ are called activation functions and are typically chosen as $$ \sigma_h( u ) = \tanh(u) $$ and $$ \sigma_o ( u ) = \frac{1}{1 + e^{-u}}\qquad \mbox{(a sigmoid function)} $$
An MLP is trained by minimizing the error over with respect to $\g g$. The error can be the sum of squared errors $$ J = \sum_{i=1}^N \| \g y_i - \g g(\g x_i)\|_2^2 , $$ which reduces to $$ J = \sum_{i=1}^N ( y_i - g(\g x_i))^2 $$ for binary classification and regression, or the cross-entropy $$ J = -\sum_{i=1}^N \sum_{k=1}^Q y_{ik} \log g_{k} ( \g x_i) = -\sum_{i=1}^N \log g_{y_i} ( \g x_i) , $$ which is not applicable to regression problems.
The minimization is typically performed by gradient descent over the weights of the network. The procedure that allows for an efficient gradient descent in an MLP is called back-propagation. It relies on the chain rule to efficiently compute the gradient of the error and all the required quantities.
First, decompose the error as a sum of pointwise error terms:
$$
J = \sum_{i=1}^N J_i .
$$
and notice that the gradient similarly decomposes as
$$
\nabla J = \sum_{i=1}^N \nabla J_i .
$$
Then, we apply a stochastic gradient descent, meaning that a single $J_i$ is considered at each step, as for the perceptron. To fix the notations, let $K = Q$ for multi-class classification and $K=1$ for binary classification and regression. We update the model by
$$
\begin{cases}
\g w_j \leftarrow \g w_j - \eta \drond{J_i}{\g w_j} & j=0,\dots, n,\\
\g v_k \leftarrow \g v_k - \eta \drond{J_i}{\g v_k} & k=1,\dots, K
\end{cases}
$$
For the output layer, the derivatives $\drond{J_i}{\g v_k}$ are readily computed. For instance, with squared loss, we have
$$
\drond{J_i}{\g v_k} = \drond{J_i}{g_k(\g x_i)} \drond{g_k(\g x_i)}{\g v_k} = \underbrace{-2 (y_{ik} - g_k(\g x_i) ) \sigma_o^\prime( \g v_k^T \g h(\g x_i) + v_{0k})}_{\delta_{ik}}\ \g h(\g x_i),
$$
where, for a sigmoid output fucntion $\sigma_o$, $\sigma_o^\prime( \g v_k^T \g h(\g x_i) + v_{0k}) = (1-g_k(\g x_i))g_k(\g x_i)$.
For the hidden layer, we apply the chain rule and obtain
\begin{align}
\drond{J_i}{\g w_j} &= \drond{J_i}{\g g(\g x_i)} \drond{\g g(\g x_i)}{h_j(\g x_i)} \drond{h_j(\g x_i)}{\g w_j} \\
&= \begin{bmatrix} -2 (y_{i1} - g_1(\g x_i) ) \\ \dots \\ -2 (y_{iK} - g_K(\g x_i) ) \end{bmatrix}^T \begin{bmatrix} \sigma_o^\prime( \g v_1^T \g h(\g x_i) + v_{01}) v_{1j}\\\dots\\\sigma_o^\prime( \g v_K^T \g h(\g x_i) + v_{0K}) v_{Kj}\end{bmatrix} \drond{h_j(\g x_i)}{\g w_j} \\
&= \left(\sum_{k=1}^K \delta_{ik} v_{kj} \right)\ \drond{h_j(\g x_i)}{\g w_j} \\
&= \left(\sum_{k=1}^K \delta_{ik} v_{kj} \right)\ \sigma_h^\prime( \g w_j^T \g x_i + w_{0j})\ \g x_i .
\end{align}
Therefore, the derivatives with respect to the weights of the hidden layer can be efficiently computed from the quantities $\delta_{ik}$ obtained during the computation of the derivatives for the output layer. In other words, the $\delta_{ik}$'s are by back-propagated from the output layer to the hidden layer. Since these contain the error terms $(y_{ik} - g_k(\g x_i))$ involving the network outputs $g_k(\g x_i)$, we obtain the following forward-backward algorithm:
With the cross-entropy loss, the standard choice for the activation function of the output layer is the softmax function that returns normalized outputs in $[0,1]$ and that sum to one as $$ g_k(\g x) = \frac{e^{-(\g v_k^T\g h(\g x) + v_{0k})}}{\sum_{l=1}^K e^{-(\g v_l^T \g h(\g x) + v_{0l}) }} = \frac{e^{-z_k}}{\sum_{l=1}^K e^{-z_l}} , $$ with $z_k = \g v_k^T\g h(\g x) + v_{0k}$. We see that the $k$th output also involves the weights $\g v_l$ of the other outputs, but on the other hand, the pointwise error $J_i = \log g_{y_i}(\g x_i)$ only depends on the $y_i$th output. Hence, we have $$ \drond{J_i}{\g v_k} = \drond{J_i}{\g g(\g x_i)} \drond{\g g(\g x_i)}{z_k}\drond{z_k}{\g v_k} = \drond{J_i}{g_{y_i}(\g x_i)} \drond{g_{y_i}(\g x_i)}{z_k}\drond{z_k}{\g v_k} = \frac{-1}{g_{y_i}(\g x_i)} \drond{g_{y_i}(\g x_i)}{z_k}\g h(\g x_i), $$ where $$ \drond{g_{y_i}(\g x_i)}{z_k} = \begin{cases} \frac{-e^{-z_{y_i}} \left(\sum_{l=1}^K e^{-z_l}\right) - e^{-z_{y_i}} (- e^{-z_k} )}{\left(\sum_{l=1}^K e^{-z_l}\right)^2} = g_{y_i}( g_k(\g x_i) - 1), & \mbox{if } k=y_i,\\ \frac{- e^{-y_i} (- e^{-z_k} )}{\left(\sum_{l=1}^K e^{-z_l}\right)^2} = g_{y_i}(\g x_i)g_k(\g x_i), & \mbox{if } k\neq y_i. \end{cases} $$ This yields $$ \drond{J_i}{\g v_k} = \begin{cases} (1 - g_k(\g x_i) \g h(\g x_i), & \mbox{if } k=y_i,\\ -g_k(\g x_i)\g h(\g x_i), & \mbox{otherwise} \end{cases} = \underbrace{(y_{ik} - g_k(\g x_i) )}_{\delta_{ik}} \g h(\g x_i). $$
The back-propagation procedure then leads to \begin{align} \drond{J_i}{\g w_j}&= \drond{J_i}{\g g(\g x_i)} \drond{\g g(\g x_i)}{h_j(\g x_i)} \drond{h_j(\g x_i)}{\g w_j} \\ &= \drond{J_i}{g_{y_i}(\g x_i)} \drond{g_{y_i}(\g x_i)}{h_j(\g x_i)} \drond{h_j(\g x_i)}{\g w_j} \\ &= \drond{J_i}{g_{y_i}(\g x_i)} \left( \sum_{k=1}^K\drond{g_{y_i}(\g x_i)}{z_k} \drond{z_k}{h_j(\g x_i)} \right) \drond{h_j(\g x_i)}{\g w_j} \\ &= \left(\sum_{k=1}^K \delta_{ik} v_{kj} \right) \ \sigma_h^\prime( \g w_j^T \g x_i + w_{0j})\ \g x_i . \end{align}