Support Vector Machines (SVM)

In words...

Support vector machines (SVMs) are binary classifiers that can be introduced with different viewpoints. They can be seen as an extension of the optimal separating hyperplane to the non-linearly separable case or as a regularized learning algorithm. We here only discuss linear SVMs, but note that they can be extended to the nonlinear case.

The optimal separating hyperplane maximizes the margin which is seen as a no man's land within the boundaries of which no data point can fall nor get misclassified. As a consequence, the optimal separating hyperplane is undefined for non-linearly separable data sets. In addition, even if the training set is linearly separable, this does not imply that the best classifier is one that classifies correctly all the training data. In other words, there are a number of situations where it might be well-advised to allow for errors during training.

The soft-margin SVM relaxes the definition of the margin to allow for such errors. The precise amount of error is determined by the value of a newly introduced hyperparameter that can be used to tune the trade-off between the maximization of the margin and the minimization of the error (or the margin softness). In this context, the hard-margin SVM is another name for the optimal separating hyperplane corresponding to the limit case where this hyperparameter tends to infinity.

We can also introduce the SVM in the context of regularized learning, where a classifier is trained by minimizing the sum of a data-fitting term and a regularization term (at which point we recover the idea of a trade-off between two goals). In SVMs, the data term is a convexified version of the empirical risk and the regularization term controls the complexity of the linear classifier via the norm of its parameter vector.

In pictures...

The plot below shows an SVM and its soft-margin for a data set in dimension 2. The support vectors are the highlighted points either lying on the margin boundary, in the margin, or on the wrong side of the hyperplane.

The SVM with hyperparameter $C$ = has parameters:
$\g w = $
$b = $

You can click inside the plot to add points and see how the classifier changes (use the mouse wheel to change the label).

In maths...

Support vector machines can be introduced with different viewpoints. The two views presented below focus on the real-valued output $g(\g x)$ of the binary classifier $$ f(\g x) = \sign( g ( \g x ) ) $$ where $g(\g x) = \inner{\g w}{\g x} + b$ for linear classification.

Softening the margin of the optimal hyperplane

Recall the optimal separating hyperplane for binary classification which is based on the maximization of the margin, formally stated as \begin{align*} (\g w^*, b^*) = \arg\min_{\g w\in\R^d, b\in\R} \ & \frac{1}{2} \|\g w\|^2 \\ \mbox{s.t. }\ & y_i(\inner{\g w}{\g x_i} + b) \geq 1 , \ i=1,\dots,N . \end{align*} In this formulation, the margin is seen as a no man's land with hard constraints preventing any data point to fall within its boundaries or be misclassified. As a consequence, the optimal separating hyperplane is undefined for non-linearly separable data sets. In addition, even if the training set is linearly separable, this does not imply that the best classifier is one that classifies correctly all the training data. In other words, there are a number of situations where it might be well-advised to allow for errors during training.

The soft-margin SVM relaxes the definition of the margin to allow for such errors. In the optimization problem above, this amounts to relaxing the constraints. This can be done by introducing positive slack variables $\xi_i$ measuring the amount of error: \begin{align*} (\g w^*, b^*, \g \xi^*) = \arg\min_{\g w\in\R^d, b\in\R, \g \xi\in\R^N} \ & \frac{1}{2} \|\g w\|^2 + C\sum_{i=1}^N \xi_i \\ \mbox{s.t. }\ & y_i(\inner{\g w}{\g x_i} + b) \geq 1 - \xi_i , \quad i=1,\dots,N .\\ & \xi_i \geq 0 \end{align*} In this formulation, we also introduced the minimization of the slack variables in order to maintain a small amount of error. The precise amount is determined by the value of the hyperparameter $C$. In other words, this hyperparameter can be used to tune the trade-off between the maximization of the margin (via the minimization of $\|\g w\|$) and the minimization of the error.

Minimizing a regularized and convexified empirical risk

In the context of regularized learning, a classifier is trained by minimizing the sum of a data-fitting term and a regularization term.
In SVMs, the data term is a convexified version of the empirical risk: $$ \sum_{i=1}^N \tilde{\ell} (y_i, \sign ( g( \g x_i) ) ) $$ with $$ \tilde{\ell} (y_i,\sign ( g( \g x_i) )) = \max \{0, 1- y_i g(\g x_i) \} = \begin{cases} 0 , &\mbox{if } y_i g( \g x_i) )\geq 1 \quad (\g x_i \mbox{ is outside of the margin}), \\ 1- y_i g(\g x_i), &\mbox{otherwise}. \end{cases} $$ On the other hand, the regularization term controlling the complexity of the linear classifier using $g(\g x) = \inner{\g w}{\g x} + b$ is $\|\g w\|^2$.

Putting all this together, we can state the training of an SVM as the optimization program \begin{align*} (\g w^*, b^* ) = \arg\min_{\g w\in\R^d, b\in\R } \ & \frac{1}{2} \|\g w\|^2 + C\sum_{i=1}^N \max \{0, 1- y_i g(\g x_i) \} \end{align*} which is equivalent to the formulation previously obtained from a different viewpoint. To see this, note that in the previous formulation, the minimization of the slack variables $\xi_i$ implies that at the optimum $$ \xi_i^* = 0 \quad \mbox{ or }\quad \xi_i^* = 1 - y_i g( \g x_i) ) . $$ and thus that $$ \xi_i^* = \max \{0, 1- y_i g(\g x_i) \} = \max \{0, 1- y_i(\inner{\g w^*}{\g x_i} + b^*) \} . $$

Solving the SVM training problem (and finding the support vectors)

Here, we consider training an SVM by solving the convex quadratic program \begin{align*} (\g w^*, b^*, \g \xi^*) = \arg\min_{\g w\in\R^d, b\in\R, \g \xi\in\R^N} \ & \frac{1}{2} \|\g w\|^2 + C\sum_{i=1}^N \xi_i \\ \mbox{s.t. }\ & y_i(\inner{\g w}{\g x_i} + b) \geq 1 - \xi_i , \quad i=1,\dots,N .\\ & \xi_i \geq 0 \end{align*} Since it is a convex quadratic program, standard optimization methods can be used. However, instead of directly applying quadratic programming algorithms, we will apply Lagrangian duality to obtain a dual formulation of the problem. The aim is to make the method work even for very high-dimensional input spaces $\X\subseteq\R^d$, for which the number of variables ($d+1$) might become prohibitive.

The dual form of the training problem of a linear SVM is the convex quadratic program \begin{align*} \max_{\g \alpha \in\R^N }\ & -\frac{1}{2} \sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j \inner{ \g x_i}{\g x_j} + \sum_{i=1}^N \alpha_i \\ \mbox{s.t. } & \sum_{i=1}^N \alpha_i y_i = 0 \\ & 0\leq \alpha_i \leq C, \ i=1,\dots,N, \end{align*} the solution of which gives the parameters as $$ \g w = \sum_{i=1}^N \alpha_i y_i \g x_i $$ and $$ b = \frac{1}{\sum_{0<\alpha_i < C } 1} \sum_{0 < \alpha_i < C} y_i - \inner{\g w}{\g x_i} . $$

Proof: We start by writing the Lagrangian of the problem: $$ L(\g w, b, \g \xi, \g \alpha, \g \mu) = \frac{1}{2} \|\g w\|^2 + C\sum_{i=1}^N \xi_i - \sum_{i=1}^N \alpha_i [y_i(\inner{\g w}{\g x_i} + b) -1 + \xi_i ] - \sum_{i=1}^N \mu_i \xi_i , $$ where the $\alpha_i \geq 0$ are the Lagrange multipliers associated to the constraints of well-classification and the $\mu_i\geq 0$ are the ones associtated to the positivity constraints on the slack variables $\xi_i$.
Lagrangian duality and the convexity of the problem imply that the solution coincides the saddle-point of the Lagrangian. This saddle-point is found by minimizing $L$ with respect to the primal variable $(\g w, b, \g \xi)$ and maximizing it with respect to the dual variables $(\g \alpha, \g \mu)$.
Thus, at the saddle-point, the derivatives with respect to the primal variables are canceled and we have \begin{align*} \drond{L(\g w, b, \g \xi, \g \alpha, \g \mu) }{\g w} = \g w - \sum_{i=1}^N \alpha_i y_i \g x_i = \g 0 & \quad \Rightarrow \quad \g w = \sum_{i=1}^N \alpha_i y_i \g x_i \\ \drond{L(\g w, b, \g \xi, \g \alpha, \g \mu) }{b} = - \sum_{i=1}^N \alpha_i y_i = 0 & \quad \Rightarrow \quad \sum_{i=1}^N \alpha_i y_i = 0 \\ \drond{L(\g w, b, \g \xi, \g \alpha, \g \mu) }{\xi_i} = C - \alpha_i - \mu_i = 0 & \quad \Rightarrow \quad \mu_i =C - \alpha_i \end{align*} Using these, we can write the dual Lagrangian as \begin{align*} L_D( \g \alpha, \g \mu) &= \inf_{\g w, b, \g \xi}\ L(\g w, b, \g \xi, \g \alpha, \g \mu) \\ &= \frac{1}{2} \left\|\sum_{i=1}^N \alpha_i y_i \g x_i \right\|^2 + C \sum_{i=1}^N \xi_i - \sum_{i=1}^N \alpha_i \left[y_i\left(\inner{\sum_{j=1}^N \alpha_j y_j \g x_j\ }{\ \g x_i} + b\right) -1 + \xi_i \right] - \sum_{i=1}^N (C-\alpha_i) \xi_i \end{align*} Note that this dual Lagrangian only depends on the $\alpha_i$'s due to the implicit constraints $\mu_i = C-\alpha_i$. Using the linearity of the inner product and the definition of the $\ell_2$-norm as $$ \sum_{i=1}^N \alpha_i y_i \inner{\sum_{j=1}^N \alpha_j y_j \g x_j\ }{\ \g x_i} = \inner{\sum_{j=1}^N \alpha_j y_j \g x_j\ }{\ \sum_{i=1}^N \alpha_i y_i \g x_i} = \left\|\sum_{i=1}^N \alpha_i y_i \g x_i \right\|^2 $$ allows us to simplify the expression of the dual Lagrangian to \begin{align*} L_D( \g \alpha) &= -\frac{1}{2} \left\|\sum_{i=1}^N \alpha_i y_i \g x_i \right\|^2 - b\sum_{i=1}^N \alpha_i y_i + \sum_{i=1}^N \alpha_i \\ & = -\frac{1}{2} \sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j \inner{ \g x_i}{\g x_j} + \sum_{i=1}^N \alpha_i & \left( \mbox{due to } \sum_{i=1}^N \alpha_i y_i = 0\right) \end{align*} In a final step, we can combine the constraints $\mu_i = C-\alpha_i$ and $\mu_i\geq 0$ to derive upper bounds on the $\alpha_i$'s as $\alpha_i \leq C$. Then, the dual form of the problem taking into account all the constraints is the one depicted in the grey box above.

Computing $b$: The Karush-Kuhn-Tucker complementary slackness conditions state that at the optimum \begin{align*} &\alpha_i [y_i(\inner{\g w}{\g x_i} + b) -1 + \xi_i ] = 0, \quad i=1,\dots,N\\ & \mu_i \xi_i = 0 \end{align*} On the one hand, the first line implies that $$ \alpha_i \neq 0 \quad \Rightarrow\quad y_i(\inner{\g w}{\g x_i} + b) = 1 - \xi_i . $$ On the other hand, the second line implies that whenever $\xi_i \neq 0$, we have $\mu_i = 0$ and $\alpha_i = C - \mu_i = C$. Thus, we cannot have $\alpha_i < C$ if $\xi_i\neq 0$, or conversly, $$ \alpha_i < C\quad \Rightarrow\quad \xi_i = 0 $$ By combining these two implications we get \begin{align*} 0 < \alpha_i < C\quad &\Rightarrow\quad y_i(\inner{\g w}{\g x_i} + b) = 1 & \left(\mbox{the point } \g x_i \mbox{ lies exactly on the margin boundary}\right)\\ & \Rightarrow\quad b = \frac{1}{y_i} (1 - y_i \inner{\g w}{\g x_i} ) \\ & \Rightarrow\quad b = y_i - \inner{\g w}{\g x_i} & \left(\mbox{due to } y_i = \sign(y_i) = \frac{1}{\sign(y_i)} = \frac{1}{y_i}\right). \end{align*}
Though this formula could be directly applied to estimate $b$ with any $i$ such that $0 < \alpha_i < C$, in practice, better numerical stability is obtained by averaging over multiple points that lie exactly on the margin boundary (as proposed in the grey box above).

Support vectors

We have already seen while discussing the optimal separating hyperplane that, in the case of a linearly separable data set, the solution only depends on a subset of the data points, which we call support vectors.

A similar property holds in the soft-margin case. For an SVM, the set of support vectors is the set of training points with index $i$ such that $\alpha_i>0$. The fact that these support the solution is clearly seen from the dual expression of $\g w$, $$ \g w = \sum_{i=1}^N \alpha_i y_i \g x_i = \sum_{\alpha_i > 0} \alpha_i y_i \g x_i , $$ in which only the pairs $(\g x_i, y_i)$ with $\alpha_i >0$ play a role (the others have Lagrange multipliers $\alpha_i = 0$ due to the constraints $\alpha_i\geq 0$).