The LASSO

In words...

The LASSO (least absolute shrinkage and selection operator) is a regularized linear regression method based on the minimization of the sum of squared errors, like ridge regression. However, while ridge regression penalizes the squared $\ell_2$-norm of the parameters, the LASSO penalizes their $\ell_1$-norm, i.e., the sum of their absolute values.

Due to this particular penalization, the LASSO tends to favor sparse solutions, i.e., linear models with few nonzero parameters. Since each parameter of a linear model is associated to a feature in the input vector, the LASSO can be seen as performing feature selection while training the model. Thus, it is particularly suited for high-dimensional data or for cases where a number of features are irrelevant.

In pictures...

Influence of the $\ell_1$ regularization

...

In maths...

The LASSO is a variant of the least squares method for linear regression, in which the regularization takes the form of a penalization on the $\ell_1$-norm of the model parameters. More precisely, given a training set $\{(\g x_i, y_i)\}_{i=1}^N \in (\R^d \times \R)^N$, the LASSO estimates the parameter vector $\g w \in\R^d$ and $b\in\R$ of the affine model $$ f(\g x) = \g w^T \g x + b $$ by minimizing the sum of squared errors while constraining the $\ell_1$-norm of $\g w$ as \begin{align} (\hat{\g w}, \hat{b}) = \arg\min_{\g w\in\R^d, b\in \R} \ & \sum_{i=1}^N (y_i - \g w^T\g x_i - b)^2 \\ \mbox{s.t. } & \|\g w\|_1 \leq t. \end{align} This can be reformulated in Lagrangian form as $$ (\hat{\g w}, \hat{b}) = \arg\min_{\g w\in\R^d, b\in \R} \ \sum_{i=1}^N (y_i - \g w^T\g x_i - b)^2 + \lambda \|\g w\|_1 , $$ where $\lambda>0$ (or $t>0$) is the hyperparameter that tunes the trade-off between the fit to the data and the penalization of the parameter magnitudes.

In general, the LASSO is more computationally demanding than the least squares method or ridge regression, since it is formulated as a quadratic programming problem without a closed-form solution. However, for observation matrices $\g X = [\g x_1^T, \dots, \g x_N^T]^T$ with orthonormal columns, there is closed-form solution via the soft-thresholding operator.

The major benefit of the $\ell_1$ regularization is a feature selection property. Indeed, minimizing the $\ell_1$-norm of a vector is known to yield sparse solutions with few nonzeros. Thus, the model can be rewritten as $$ f(\g x) = \sum_{j : w_j\neq 0} w_j x_j + b $$ with a subset of the input variables $x_j$ influencing its output.

Soft-thresholding

If the observation matrix $\g X = [\g x_1^T, \dots, \g x_N^T]^T$ has orthonormal columns, i.e., if $\g X^T \g X = \g I$, the LASSO estimate $\hat{\g w}$ can be readily computed by soft-thresholding the least squares estimate $$ \begin{bmatrix}\hat{\g w_{LS}} \\ \hat{b}_{LS}\end{bmatrix} = ([\g X,\ \g 1]^T [\g X,\ \g 1])^{-1} [\g X,\ \g 1]^T \g y $$ as $$ \hat{\g w} = \sign( \hat{\g w}_{LS} ) \left( \left|\hat{\g w}_{LS}\right| - \lambda\right)_+ ,\qquad \hat{b} = \hat{b}_{LS} $$ or, component-wise: $$ \hat{w}_j = \begin{cases} \sign(\hat{w}_{LS,j}) \left( \left|\hat{w}_{LS,j}\right| - \lambda\right), & \mbox{if } |\hat{w}_{LS,j}| > \lambda \\ 0, &\mbox{otherwise.} \end{cases} $$