# Ridge regression

## In words...

Ridge regression is a regularized version of the least squares method for linear regression. It is based on the search for the linear model that minimizes a trade-off between the sum of squared errors over the training set and the norm of the parameter vector.

Ridge regression is particularly useful for high-dimensional problems, when the number of data becomes insufficient for the least squares method.

## In pictures...

### Influence of the ridge regularization

Here is an example of affine regression in dimension 1, i.e., with a single input variable $x$ but two parameters (the slope of the line and the offset at the origin).

The target function is $$t(x) = \frac{x}{2} + 2$$ With $N =$ data points in the training set,
the least squares method yields the linear model

$f(x) =$ $x +$

while ridge rergession with $\lambda =$ yields

$f(x) =$ $x +$

You can click in the plot to add data points and see how the least squares model reacts.

## In maths...

Ridge regression extends the least squares method for linear regression, which estimates the parameter vector $\g w \in\R^d$ of linear models $$f(\g x) = \g w^T \g x$$ by minimizing the sum of squared errors over the training set $\{(\g x_i, y_i)\}_{i=1}^N \in (\R^d \times \R)^N$, by introducing a regularization term in the objective function. This leads to $$\hat{\g w} = \arg\min_{\g w\in\R^d} \ J(\g w) = \sum_{i=1}^N (y_i - \g w^T\g x_i)^2 + \lambda \|\g w\|_2^2 ,$$ where $\lambda>0$ is the hyperparameter that tunes the trade-off between the fit to the data and the control of the model complexity, here measured via the norm of $\g w$.

Ridge regression benefits from a similar computational efficiency compared with the least squares method, since the convex optimization problem above also has a closed form solution. Introducing the target vector $\g y$ and the observation matrix $\g X$ as $$\g y = \begin{bmatrix}y_1\\\vdots\\y_N\end{bmatrix} \in\R^N,\quad \mbox{ and }\quad \g X = \begin{bmatrix}\g x_1^T\\\vdots\\\g x_N^T\end{bmatrix} \in\R^{N\times d} ,$$ we can rewrite the objective function as \begin{align*} J(\g w) &= \|\g y - \g X\g w\|_2^2 + \lambda \|\g w\|_2^2 \\ &= (\g y - \g X\g w)^T (\g y - \g X\g w) + \lambda \|\g w\|_2^2 \\ &= \g w^T\g X^T\g X \g w - 2\g w^T\g X^T\g y + \g y^T\g y + \lambda \|\g w\|_2^2 \\ &= \g w^T( \g X^T\g X + \lambda\g I) \g w - 2\g w^T\g X^T\g y + \g y^T\g y \end{align*} where $\g I$ is the identity matrix. Since the objective function $J$ is convex, we know that its gradient cancels at the optimum: $$\nabla J (\hat{\g w}) = 2( \g X^T\g X + \lambda\g I) \hat{\g w} - 2 \g X^T\g y = \g 0$$ which implies that $$( \g X^T\g X + \lambda\g I) \hat{\g w} = \g X^T\g y.$$ Thus, the ridge regression estimate $\hat{\g w}$ can be found by solving a linear system, which can be performed very efficiently. In addition, we also have the following closed form for the solution.

If the the matrix $( \g X^T\g X + \lambda\g I)$ is invertible, then the ridge regression estimate is given by $$\hat{\g w} = ( \g X^T\g X + \lambda\g I)^{-1} \g X^T\g y.$$

Compared with the closed-form solution of the least squares method, this one is numerically more attractive as the regularization improves the conditioning of the matrix to inverse.