Ridge regression is a regularized version of the least squares method for linear regression. It is based on the search for the linear model that minimizes a trade-off between the sum of squared errors over the training set and the norm of the parameter vector.

Ridge regression is particularly useful for high-dimensional problems, when the number of data becomes insufficient for the least squares method.

The target function is $$t(x) = \frac{x}{2} + 2$$
With $N = $
data points in the training set,

the least squares method yields the linear model

$f(x) = $ $x + $

while ridge rergession with $\lambda = $ yields

$f(x) = $ $x + $

*You can click in the plot to add data points and see how the least squares model reacts.*

Ridge regression extends the least squares method for linear regression, which estimates the parameter vector $\g w \in\R^d$ of linear models $$ f(\g x) = \g w^T \g x $$ by minimizing the sum of squared errors over the training set $\{(\g x_i, y_i)\}_{i=1}^N \in (\R^d \times \R)^N$, by introducing a regularization term in the objective function. This leads to $$ \hat{\g w} = \arg\min_{\g w\in\R^d} \ J(\g w) = \sum_{i=1}^N (y_i - \g w^T\g x_i)^2 + \lambda \|\g w\|_2^2 , $$ where $\lambda>0$ is the hyperparameter that tunes the trade-off between the fit to the data and the control of the model complexity, here measured via the norm of $\g w$.

Ridge regression benefits from a similar computational efficiency compared with the least squares method, since the convex optimization problem above also has a closed form solution. Introducing the target vector $\g y$ and the observation matrix $\g X$ as $$ \g y = \begin{bmatrix}y_1\\\vdots\\y_N\end{bmatrix} \in\R^N,\quad \mbox{ and }\quad \g X = \begin{bmatrix}\g x_1^T\\\vdots\\\g x_N^T\end{bmatrix} \in\R^{N\times d} , $$ we can rewrite the objective function as \begin{align*} J(\g w) &= \|\g y - \g X\g w\|_2^2 + \lambda \|\g w\|_2^2 \\ &= (\g y - \g X\g w)^T (\g y - \g X\g w) + \lambda \|\g w\|_2^2 \\ &= \g w^T\g X^T\g X \g w - 2\g w^T\g X^T\g y + \g y^T\g y + \lambda \|\g w\|_2^2 \\ &= \g w^T( \g X^T\g X + \lambda\g I) \g w - 2\g w^T\g X^T\g y + \g y^T\g y \end{align*} where $\g I$ is the identity matrix. Since the objective function $J$ is convex, we know that its gradient cancels at the optimum: $$ \nabla J (\hat{\g w}) = 2( \g X^T\g X + \lambda\g I) \hat{\g w} - 2 \g X^T\g y = \g 0 $$ which implies that $$ ( \g X^T\g X + \lambda\g I) \hat{\g w} = \g X^T\g y. $$ Thus, the ridge regression estimate $ \hat{\g w}$ can be found by solving a linear system, which can be performed very efficiently. In addition, we also have the following closed form for the solution.

If the the matrix $( \g X^T\g X + \lambda\g I)$ is invertible, then the ridge regression estimate is given by $$ \hat{\g w} = ( \g X^T\g X + \lambda\g I)^{-1} \g X^T\g y. $$

Compared with the closed-form solution of the least squares method, this one is numerically more attractive as the regularization improves the conditioning of the matrix to inverse.