Ridge regression

In words...

Ridge regression is a regularized version of the least squares method for linear regression. It is based on the search for the linear model that minimizes a trade-off between the sum of squared errors over the training set and the norm of the parameter vector.

Ridge regression is particularly useful for high-dimensional problems, when the number of data becomes insufficient for the least squares method.

In pictures...

Influence of the ridge regularization

Here is an example of affine regression in dimension 1, i.e., with a single input variable $x$ but two parameters (the slope of the line and the offset at the origin).

The target function is $$t(x) = \frac{x}{2} + 2$$ With $N = $ data points in the training set,
the least squares method yields the linear model

$f(x) = $ $x + $

while ridge rergession with $\lambda = $ yields

$f(x) = $ $x + $

You can click in the plot to add data points and see how the least squares model reacts.

$x$ $y$

In maths...

Ridge regression extends the least squares method for linear regression, which estimates the parameter vector $\g w \in\R^d$ of linear models $$ f(\g x) = \g w^T \g x $$ by minimizing the sum of squared errors over the training set $\{(\g x_i, y_i)\}_{i=1}^N \in (\R^d \times \R)^N$, by introducing a regularization term in the objective function. This leads to $$ \hat{\g w} = \arg\min_{\g w\in\R^d} \ J(\g w) = \sum_{i=1}^N (y_i - \g w^T\g x_i)^2 + \lambda \|\g w\|_2^2 , $$ where $\lambda>0$ is the hyperparameter that tunes the trade-off between the fit to the data and the control of the model complexity, here measured via the norm of $\g w$.

Ridge regression benefits from a similar computational efficiency compared with the least squares method, since the convex optimization problem above also has a closed form solution. Introducing the target vector $\g y$ and the observation matrix $\g X$ as $$ \g y = \begin{bmatrix}y_1\\\vdots\\y_N\end{bmatrix} \in\R^N,\quad \mbox{ and }\quad \g X = \begin{bmatrix}\g x_1^T\\\vdots\\\g x_N^T\end{bmatrix} \in\R^{N\times d} , $$ we can rewrite the objective function as \begin{align*} J(\g w) &= \|\g y - \g X\g w\|_2^2 + \lambda \|\g w\|_2^2 \\ &= (\g y - \g X\g w)^T (\g y - \g X\g w) + \lambda \|\g w\|_2^2 \\ &= \g w^T\g X^T\g X \g w - 2\g w^T\g X^T\g y + \g y^T\g y + \lambda \|\g w\|_2^2 \\ &= \g w^T( \g X^T\g X + \lambda\g I) \g w - 2\g w^T\g X^T\g y + \g y^T\g y \end{align*} where $\g I$ is the identity matrix. Since the objective function $J$ is convex, we know that its gradient cancels at the optimum: $$ \nabla J (\hat{\g w}) = 2( \g X^T\g X + \lambda\g I) \hat{\g w} - 2 \g X^T\g y = \g 0 $$ which implies that $$ ( \g X^T\g X + \lambda\g I) \hat{\g w} = \g X^T\g y. $$ Thus, the ridge regression estimate $ \hat{\g w}$ can be found by solving a linear system, which can be performed very efficiently. In addition, we also have the following closed form for the solution.

If the the matrix $( \g X^T\g X + \lambda\g I)$ is invertible, then the ridge regression estimate is given by $$ \hat{\g w} = ( \g X^T\g X + \lambda\g I)^{-1} \g X^T\g y. $$

Compared with the closed-form solution of the least squares method, this one is numerically more attractive as the regularization improves the conditioning of the matrix to inverse.