Support Vector Regression

In words...

Support Vector Regression is a regularized learning method for linear regression. Its main difference with ridge regression is that it is based on the $\varepsilon$-insensitive loss function instead of the more standard squared loss. This loss function creates a tube of insensitivity around the model. The width of this tube can be seen as a threshold on the pointwise error below which the data points do not incur a loss during training.

Support vector regression is closely related to support vector machines for classification and benefit from similar features such as relying on a subset of the data points, called the support vectors. It can also be similarly extended to the nonlinear case via the kernel trick.

Support vector regression involves two hyperparameters: the regularization constant and the threshold used in the loss function. The latter can be tuned automatically by a variant known as $\nu$-support vector regression.

In pictures...

The $\varepsilon$-insensitive loss function

Here is a plot of the $\varepsilon$-insensitive loss function shown as a function of the error $y-\hat{y}$ between the target label $y$ and the predicted value $\hat{y}$.

The loss is zero for all errors of magnitude less than $\varepsilon$.

The tube of insensitivity

Here is an example of support vector regression in dimension 1, i.e., with a single input variable $x$.

The target function is $$t(x) = \frac{x}{2} + 2$$ With $N = $ data points in the training set,
$C = $ and $\varepsilon =$ ,
support vector regression yields the model

$f(x) = $ $x + $

You can click in the plot to add data points and see how the model reacts.

Exercise: experiment with the insensitivity of the $\varepsilon$-insensitive loss function by adding points inside the tube; the resulting model should not change.

Exercise: observe the robustness to outliers provided by the $\varepsilon$-insensitive loss function by adding points far from the model; the model should be only slightly affected by such points (compared with, e.g., ridge regression). This is due to the close relationship between the $\varepsilon$-insensitive loss function and the absolute loss: the cost induced by large errors depends only linearly on the magnitudes of the errors.

In maths...

Primal formulation and the tube of insensitivity

Support vector regression deals with linear regression with the estimation of the parameter vector $\g w \in\R^d$ and the bias $b\in\R$ of linear (affine) models $$ f(\g x) = \g w^T \g x + b $$ based on the minimization of a trade-off between the sum of errors over the training set $\{(\g x_i, y_i)\}_{i=1}^N \in (\R^d \times \R)^N$ and a regularization term. A particularlity of this approach is the use of the $\varepsilon$-insensitive loss function $$ \ell(y, \hat{y}) = \begin{cases} | y - \hat{y}| - \varepsilon,& \mbox{if } | y - \hat{y}| > \varepsilon \\ 0, & \mbox{otherwise } , \end{cases} $$ which can also be written as $$ \ell(y, \hat{y} ) = \max \{ 0,\ | y - \hat{y}| - \varepsilon \}. $$ The loss function can be interpreted as building a tube of insensitivity aroung the model $f$: all data points that lie inside this tube of width $2\varepsilon$ are ignored, meaning that they do not induce an additional loss.

Formally, a support vector regression trains a linear model by solving $$ (\hat{\g w}, \hat{b}) = \arg\min_{\g w\in\R^d, b\in\R} \ \frac{1}{2} \|\g w\|_2^2 + C \sum_{i=1}^N \max \{ 0,\ | y_i - \g w^T \g x_i - b | - \varepsilon \} , $$ where $C>0$ is the hyperparameter that tunes the trade-off between the fit to the data and the control of the model complexity via the norm of $\g w$.

The training problem above can be reformulated as a quadratic program by adding positive slack variables $\xi_i$ and $\xi_i^\prime$: \begin{align} \min_{\g w\in\R^d, b\in\R, \g \xi\in\R^N , \g \xi^\prime \in\R^N} \ &\frac{1}{2} \|\g w\|_2^2 + C \sum_{i=1}^N (\xi_i + \xi_i^\prime)\\ \mbox{s.t. } & y_i - \g w^T \g x_i - b \leq \varepsilon + \xi_i \\ &\g w^T \g x_i + b - y_i \leq \varepsilon + \xi_i^\prime ,\quad i=1,\dots, N\\ & \xi_i \geq 0\\ & \xi_i^\prime \geq 0 \end{align}

Solving the training problem via the dual form

For large-scale problems it is more convenient to work with the dual formulation obtained by Lagrangian duality. In particular, this dual formulation will be easily extended to the nonlinear case.

The dual form of the training problem of a linear support vector regression is the convex quadratic program \begin{align} \max_{\g \alpha\in\R^N,\g \alpha_i^\prime \in\R^N} \ & -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N(\alpha_i - \alpha_i^\prime)(\alpha_j - \alpha_j^\prime)\inner{\g{x}_i}{ \g{x}_j} - \varepsilon \sum_{i=1}^N(\alpha_i + \alpha_i^\prime) + \sum_{i=1}^N y_i(\alpha_i - \alpha_i^\prime)\\ \mbox{s.t. } & \sum_{i=1}^N(\alpha_i - \alpha_i^\prime) = 0 \\ & 0\leq\alpha_i\leq C,\quad i=1,\dots,N\\ & 0\leq\alpha_i^\prime\leq C,\quad i=1,\dots,N, \end{align} the solution of which gives the parameters as $$ \g w = \sum_{i=1}^N (\alpha_i - \alpha_i^\prime) \g x_i $$ and $$ b = \frac{0.5}{\sum_{0<\alpha_i < C } 1} \sum_{0 < \alpha_i < C} y_i - \varepsilon - \inner{\g w}{\g x_i} +\frac{0.5}{\sum_{0<\alpha_i^\prime < C } 1} \sum_{0 < \alpha_i^\prime < C} y_i + \varepsilon - \inner{\g w}{\g x_i} . $$

Proof: We start by writing the Lagrangian of the original quadratic problem by introducing the Lagrange multipliers $\eta_i$ and $\eta_i^\prime \geq 0$ associated to the postivity constraints on $\xi_i$ and $\xi_i^\prime$ and the multipliers $\alpha_i$, $\alpha_i^\prime \geq 0$ to deal with the constraints on the data points. This gives: \begin{align} L(\g w, b, \g \xi, \g \xi^\prime, \g \eta, \g \eta^\prime, \g \alpha, \g \alpha^\prime) = & \frac{1}{2}\|\g{w}\|^2 + C\sum_{i=1}^N(\xi_i + \xi_i^\prime) - \sum_{i=1}^N(\eta_i\xi_i + \eta_i^\prime\xi_i^\prime) \\ &- \sum_{i=1}^N \alpha_i(\varepsilon + \xi_i - y_i +\inner{\g{w}}{\g{x}_i} + b) - \sum_{i=1}^N \alpha_i^\prime(\varepsilon + \xi_i^\prime + y_i - \inner{\g{w}}{\g x_i} - b), \end{align} whose saddle-point gives the solution, i.e., we have to minimize $L$ with respect to the primal variables $(\g w, b, \g \xi, \g \xi^\prime)$ while maximizing it with respect to the dual variables $(\g \eta, \g \eta^\prime, \g \alpha, \g \alpha^\prime)$.
Computing the derivatives of $L$ with respect to the primal variables yields \begin{align} \frac{\partial L}{\partial \g{w}} &= \g{w} - \sum_{i=1}^N(\alpha_i^\prime - \alpha_i)\g x_i = 0 & \Rightarrow & \g{w} = \sum_{i=1}^N(\alpha_i^\prime - \alpha_i)\g x_i\\ \frac{\partial L}{\partial b} &= \sum_{i=1}^N(\alpha_i^\prime - \alpha_i) = 0\\ \frac{\partial L}{\partial \xi_i} &= C - \alpha_i - \eta_i = 0 & \Rightarrow & \eta_i = C - \alpha_i \\ \frac{\partial L}{\partial \xi_i^\prime} &= C - \alpha_i^\prime - \eta_i^\prime = 0 & \Rightarrow & \eta_i^\prime = C - \alpha_i^\prime. \end{align} Introducing these results in the Lagrangian $L$ gives the dual Lagrangian $L_D$ which only involves the dual variables and which must be maximized with respect to these: \begin{align} L_D = &\sum_{i=1}^N\sum_{j=1}^N(\alpha_i - \alpha_i^\prime)(\alpha_j - \alpha_j^\prime)\inner{\g{x}_i}{ \g{x}_j} + \sum_{i=1}^N (\alpha_i \xi_i + \alpha_i^\prime \xi_i^\prime) \\ & - \sum_{i=1}^N \alpha_i\left(\varepsilon + \xi_i - y_i +\inner{\sum_{j=1}^N(\alpha_j^\prime - \alpha_j)\g x_j}{\g{x}_i} + b\right) - \sum_{i=1}^N \alpha_i^\prime \left(\varepsilon + \xi_i^\prime + y_i - \inner{\sum_{j=1}^N(\alpha_j^\prime - \alpha_j)\g x_j}{\g x_i} - b\right) \\ = &-\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N (\alpha_i - \alpha_i^\prime)(\alpha_j - \alpha_j^\prime)\inner{\g{x}_i}{ \g{x}_j} - \varepsilon \sum_{i=1}^N(\alpha_i + \alpha_i^\prime) + \sum_{i=1}^N y_i(\alpha_i - \alpha_i^\prime) + b\sum_{i=1}^N ( \alpha_i^\prime - \alpha_i) % & \left(\mbox{due to the linearity and symmetry of the inner product}\right) \\ = &-\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N( \alpha_i - \alpha_i^\prime)(\alpha_j - \alpha_j^\prime)\inner{\g{x}_i}{ \g{x}_j} - \varepsilon \sum_{i=1}^N(\alpha_i + \alpha_i^\prime) + \sum_{i=1}^N y_i(\alpha_i - \alpha_i^\prime) & \left(\mbox{due to } \sum_{i=1}^N(\alpha_i^\prime - \alpha_i) = 0 \right) \end{align} In addition, since the Lagrange multipliers $\eta_i$ and $\eta_i^\prime$ are positive, we have $$ C - \alpha_i \geq 0 \quad \Rightarrow \quad \alpha_i \leq C $$ and $$ C - \alpha_i \geq 0 \quad \Rightarrow \quad \alpha_i \leq C $$ which give the box constraints on the dual variables $\alpha_i$ and $\alpha_i^\prime$ in the dual problem depicted in the grey box above.

Computing $b$: The Karush-Kuhn-Tucker complementary slackness conditions state that at the optimum \begin{align} \alpha_i(\varepsilon + \xi_i - y_i + \inner{\g w}{\g x_i} + b) &= 0 \\ \alpha_i^\prime(\varepsilon + \xi_i^\prime + y_i - \inner{\g w}{\g x_i} - b) &= 0\\ (C-\alpha_i)\xi_i &= 0\\ (C-\alpha_i^\prime)\xi_i^\prime &= 0. \end{align} On the one hand, the first two lines imply that $$ \alpha_i \neq 0 \quad \Rightarrow\quad \inner{\g w}{\g x_i} + b = y_i -\varepsilon - \xi_i $$ and $$ \alpha_i^\prime \neq 0 \quad \Rightarrow\quad \inner{\g w}{\g x_i} + b = y_i +\varepsilon + \xi_i^\prime . $$ On the other hand, the two last lines imply that whenever $\xi_i \neq 0$, we have $\alpha_i = C$ and similarly $\xi_i^\prime\neq 0\ \Rightarrow \alpha_i^\prime = C$. Thus, we cannot have $\alpha_i$ or $\alpha_i^\prime$ strictly lower than $C$ if $\xi_i$ or $\xi_i^\prime$ are nonzero. Conversly, we have $$ \alpha_i < C\quad \Rightarrow\quad \xi_i = 0 $$ and $$ \alpha_i^\prime < C\quad \Rightarrow\quad \xi_i^\prime = 0 . $$ By combining these implications we get \begin{align*} 0 < \alpha_i < C\quad &\Rightarrow\quad \inner{\g w}{\g x_i} + b = y_i - \varepsilon& \left(\mbox{the point } (\g x_i, y_i) \mbox{ lies exactly on the lower tube boundary}\right)\\ & \Rightarrow\quad b = y_i - \varepsilon - \inner{\g w}{\g x_i} \end{align*} and \begin{align*} 0 < \alpha_i^\prime < C\quad &\Rightarrow\quad \inner{\g w}{\g x_i} + b = y_i +\varepsilon & \left(\mbox{the point } (\g x_i, y_i) \mbox{ lies exactly on the upper tube boundary}\right)\\ & \Rightarrow\quad b = y_i + \varepsilon - \inner{\g w}{\g x_i} . \end{align*} This also shows that we cannot have $\alpha_i \neq 0$ while $\alpha_i^\prime \neq 0$ and vice versa. In other words, the point cannot lie at the same time both on the upper and lower boundaries of the tube (unless $\varepsilon = 0$).
Though these formulas could be directly applied to estimate $b$ with any $i$ such that, e.g., $0 < \alpha_i < C$, in practice, better numerical stability is obtained by averaging over multiple points that lie exactly on the tube boundary (as proposed in the grey box above).

Support vectors

The set of support vectors is the set of training points with index $i$ such that $\alpha_i+\alpha_i^\prime > 0$. The fact that these support the model is clearly seen from the dual expression of $\g w$, $$ \g w = \sum_{i=1}^N (\alpha_i - \alpha_i^\prime) \g x_i = \sum_{\alpha_i>0} \alpha_i \g x_i - \sum_{\alpha_i^\prime > 0}\alpha_i^\prime \g x_i = \sum_{\alpha_i+\alpha_i^\prime > 0} (\alpha_i - \alpha_i^\prime) \g x_i, $$ in which only the pairs $(\g x_i, y_i)$ with $\alpha_i + \alpha_i^\prime >0$ play a role (the others have Lagrange multipliers $\alpha_i =\alpha_i^\prime= 0$ due to the constraints $\alpha_i\geq 0$, $\alpha_i^\prime \geq 0$).