Support Vector Machines (SVMs)

In words...

Support vector machines (SVMs) are binary classifiers that can be introduced with different viewpoints. In the linear case, they can be seen as an extension of the optimal separating hyperplane or as a regularized learning algorithm. In the nonlinear case discussed here, they can be seen as an extension of the optimal separating hyperplane in feature space based on the kernel trick or as a regularized learning algorithm in Reproducing Kernel Hilbert Space (RKHS).

Training an SVM amounts to solving a convex quadratic programming problem, while predictions are computed via a linear combination of kernel functions.

In pictures...

The plot below shows the classification obtained by an SVM trained with the data set plotted as dots.
The support vectors are the highlighted points either lying on the margin boundary, in the margin (the grey-shaded area), or misclassified.

You can change the SVM parameters or click inside the plot to add points (label chosen with the mousewheel) and see how the classifier changes.

Choose a kernel :

and a kernel parameter:

Set the regularization parameter $C$ =

The SVM classifier is

$\displaystyle{f(\g x) = \sign\left( \sum_{i=1}^N \alpha_i y_i \right. }$ $\displaystyle{\left. \exp\left(\frac{-(\g x-\g x_i)^2}{2\sigma^2}\right) \right)}$ $\displaystyle{\ \g x_i^T\g x\Big)}$ with $\sigma = $ with $\gamma = $ with $\gamma = $

In maths...

Support vector machines can be introduced with different viewpoints. The two views presented below focus on the real-valued output $g(\g x)$ of the binary classifier $$ f(\g x) = \sign( g ( \g x ) ) . $$

The soft-margin SVM in feature space

Recall the linear soft-margin SVM based on the real-valued function $$ g(\g x) = \inner{\g w}{\g x} + b $$ trained by solving the convex quadratic program \begin{align*} \min_{\g w\in\R^d, b\in\R, \g \xi\in\R^N} \ & \frac{1}{2} \|\g w\|^2 + C\sum_{i=1}^N \xi_i \\ \mbox{s.t. }\ & y_i(\inner{\g w}{\g x_i} + b) \geq 1 - \xi_i , \quad i=1,\dots,N \\ & \xi_i \geq 0 \end{align*} via its dual form \begin{align*} \max_{\g \alpha \in\R^N }\ & -\frac{1}{2} \sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j \inner{ \g x_i}{\g x_j} + \sum_{i=1}^N \alpha_i \\ \mbox{s.t. } & \sum_{i=1}^N \alpha_i y_i = 0 \\ & 0\leq \alpha_i \leq C, \ i=1,\dots,N, \end{align*} the solution of which yields the parameter vector $\g w\in\X\subseteq \R^d$ as $$ \g w = \sum_{i=1}^N \alpha_i y_i \g x_i . $$

Following the standard procedure for nonlinear classification, we introduce a nonlinear mapping $\phi : \X\rightarrow \X^\phi$ which projects the data in feature space $\X^\phi$. This means that we are now searching for a function $$ g(\g x) = \inner{\g w}{\phi(\g x)}_{\X^\phi} + b, $$ with a parameter vector $\g w$ in $\X^\phi$ instead of $\X$, by solving \begin{align*} \min_{\g w\in\X^\phi, b\in\R, \g \xi\in\R^N} \ & \frac{1}{2} \|\g w\|^2 + C\sum_{i=1}^N \xi_i \\ \mbox{s.t. }\ & y_i(\inner{\g w}{\phi(\g x_i)}_{\X^\phi} + b) \geq 1 - \xi_i , \quad i=1,\dots,N .\\ & \xi_i \geq 0 \end{align*} The good news is that the derivation of the dual form remains unchanged, except for the substitution of $\phi(\g x_i)$ for $\g x_i$ and the change of inner product. In particular, this yields the quadratic programming dual problem \begin{align*} \max_{\g \alpha \in\R^N }\ & -\frac{1}{2} \sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j \inner{ \phi(\g x_i)}{\phi(\g x_j)}_{\X^\phi} + \sum_{i=1}^N \alpha_i \\ \mbox{s.t. } & \sum_{i=1}^N \alpha_i y_i = 0 \\ & 0\leq \alpha_i \leq C, \ i=1,\dots,N, \end{align*} and $$ \g w = \sum_{i=1}^N \alpha_i y_i \phi(\g x_i) . $$ At this point, we notice that both the dual form of the training problem and the classification function $$ f(\g x) = \sign\left( \inner{\sum_{i=1}^N \alpha_i y_i \phi(\g x_i) }{\phi(\g x)}_{\X^\phi} + b \right) = \sign\left(\sum_{i=1}^N \alpha_i y_i \inner{\phi(\g x_i) }{\phi(\g x)}_{\X^\phi} + b \right) $$ involve the data points only via inner products. This means that we can apply the kernel trick and replace these inner products by a kernel function $K : \X\times \X \rightarrow \R$ such that $$ \forall (\g x, \g x^\prime) \in \X^2,\quad K(\g x, \g x^\prime) = \inner{ \phi(\g x)}{\phi(\g x^\prime)}_{\X^\phi} . $$ This leads to the nonlinear SVM classifier $$ f(\g x) = \sign\left(\sum_{i=1}^N \alpha_i y_i K(\g x_i , \g x) + b \right) $$ trained by solving the (still convex and quadratic) dual problem \begin{align*} \max_{\g \alpha \in\R^N }\ & -\frac{1}{2} \sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(\g x_i , \g x_j) + \sum_{i=1}^N \alpha_i \\ \mbox{s.t. } & \sum_{i=1}^N \alpha_i y_i = 0 \\ & 0\leq \alpha_i \leq C, \ i=1,\dots,N. \end{align*} Note that the nonlinear mapping $\phi$ and the feature space $\X^\phi$ are only implicitly involved in this final training problem. In particular, the dimension of the feature space does not influence the number of variables in the optimization problem.

Though we formulated the SVM training problem as a quadratic program with a reasonable number of variables (equal to the size of the training set), solving this problem is still difficult for general purpose quadratic programming solvers and dedicated SVM solvers are developed. The reason is that the $N\times N$ matrix defining the quadratic term of the objective function with entries $Q_{ij} = y_iy_j K(\g x_i , \g x_j)$ is dense (with very few or no zeros). In addition, typical machine learning applications can involve large training sets with hundreds of thousands of data, in which case this matrix simply does not fit in memory and cannot be passed to the generic solver.

Minimizing a regularized and convexified empirical risk in RKHS

In the context of regularized learning, a classifier is trained by minimizing the sum of a data-fitting term and a regularization term.
In SVMs, the data term is a convexified version of the empirical risk: $$ \sum_{i=1}^N \tilde{\ell} (y_i, \sign ( g( \g x_i) ) ) $$ with $$ \tilde{\ell} (y_i,\sign ( g( \g x_i) )) = \max \{0, 1- y_i g(\g x_i) \} = \begin{cases} 0 , &\mbox{if } y_i g( \g x_i) )\geq 1 \quad (\g x_i \mbox{ is outside of the margin}), \\ 1- y_i g(\g x_i), &\mbox{otherwise}. \end{cases} $$

In the nonlinear case, the real-valued function $g$ belongs to a Reproducing Kernel Hilbert Space (RKHS) $\H$ defined as $$ \H = \left\{ g\in\R^{\X} \ :\ g = \sum_{i=1}^\infty \beta_i K(\g x_i, \cdot) ,\ \beta_i \in\R,\ \g x_i\in\X,\ \|g\|_{\H} <+\infty \right\} . $$ The RKHS norm $\|\cdot\|_{\H}$ appearing in this definition is the one induced as $$ \|g\|_{\H} = \sqrt{\inner{g}{g}} $$ by the inner product defined for two functions $g = \sum_{i=1}^\infty \beta_i K(\g x_i, \cdot)$ and $g^\prime = \sum_{i=1}^\infty \beta_i^\prime K(\g x_i^\prime, \cdot)$ as $$ \inner{g}{g^\prime}_{\H} = \sum_{i=1}^\infty \sum_{j=1}^{\infty} \beta_i \beta_j^\prime K(\g x_i, \g x_j^\prime) . $$ In this case, it is natural to measure the complexity of a function $g$ by its norm and thus to use $\|g\|_{\H}^2$ for the regularization term instead of $\|\g w\|_2^2$ in the linear case.

Putting all this together, we can state the training of a nonlinear SVM as the optimization program \begin{align*} \min_{g\in\H } \ & \frac{1}{2} \|g\|_{\H}^2 + C\sum_{i=1}^N \max \{0, 1- y_i g(\g x_i) \} . \end{align*}

At first, it might seem that this version is far from the previous derivation, in particular regarding the final form of the classifier. In the previous derivation, the classifier is based on a linear combination of kernel functions computed at the $N$ training points, while here the linear combination involves an infinite number of terms. However, the representer theorem states that the solution to the functional learning problem above is of the form $$ g = \sum_{i=1}^N \beta_i K(\g x_i, \cdot) , $$ where the $\g x_i$'s are precisely the points from the training set. Setting $\beta_i = \alpha_iy_i$ then yields the connection with the dual form of the model derived previously. In addition, a clear connection between the primal form of the linear model in feature space $g = \inner{\g w}{\phi(\g x)}_{\X^\phi}$ and the model above is obtained by choosing the nonlinear mapping $\phi(\g x) = K(\g x,\cdot)$ and $\X^\phi = \H$.

The final step in drawing these links is to consider an RKHS extended with a constant function, $$ \tilde{\H} = \H \oplus \{1\}, $$ with which we obtain $\tilde{g}\in\tilde{\H}$ in the form $\tilde{g} = g + b$ with $g\in\H$ and $b\in\R$.