The least squares method

In words...

The least squares method is perhaps the most popular method for linear regression. It is based on the search for the linear model that minimizes the sum of squared errors over the training set. As such, it can be interpreted as a direct application of the ERM principle.

Under the assumption of a linear model with additive Gaussian noise, the least squares method can be equivalently recovered by application of the maximum likelihood inference principle.

A major advantage of this method is its computational efficiency, since it requires only the solution to a linear system of equations.

If one wants to learn an affine model rather than a linear one, then it suffices to apply the method to the extended input vectors.

In pictures...

One-dimensional least squares regression

Here is an example of linear regression in dimension 1, i.e., with a single input variable $x$.

The target function is $$t(x) = \frac{x}{2}$$ With $N = $ data points in the training set,
the least squares method yields the linear model

$f(x) = $ $x$

You can click in the plot to add data points and see how the least squares model reacts.

$x$ $y$

Affine regression with least squares

Here is an example of affine regression in dimension 1, i.e., with a single input variable $x$ but two parameters (the slope of the line and the offset at the origin).

The target function is $$t(x) = \frac{x}{2} + 2$$ With $N = $ data points in the training set,
the least squares method yields the linear model

$f(x) = $ $x + $

You can click in the plot to add data points and see how the least squares model reacts.

$x$ $y$

In maths...

The least squares method for linear regression estimates the parameter vector $\g w \in\R^d$ of linear models $$ f(\g x) = \g w^T \g x $$ by minimizing the sum of squared errors over the training set $\{(\g x_i, y_i)\}_{i=1}^N \in (\R^d \times \R)^N$: $$ \hat{\g w} = \arg\min_{\g w\in\R^d} \ \sum_{i=1}^N (y_i - \g w^T\g x_i)^2 . $$

The computational efficiency of the least squares method comes from the fact that this convex optimization problem has a closed form solution. Rewrite its objective function as $$ J(\g w) = \sum_{i=1}^N (y_i - \g w^T\g x_i)^2 = \|\g y - \g X\g w\|_2^2 = (\g y - \g X\g w)^T (\g y - \g X\g w) = \g w^T\g X^T\g X \g w - 2\g w^T\g X^T\g y + \g y^T\g y $$ with the target vector and the observation matrix defined as $$ \g y = \begin{bmatrix}y_1\\\vdots\\y_N\end{bmatrix} \in\R^N,\quad \mbox{ and }\quad \g X = \begin{bmatrix}\g x_1^T\\\vdots\\\g x_N^T\end{bmatrix} \in\R^{N\times d} . $$ Since the objective function $J$ is convex, we know that its gradient cancels at the optimum: $$ \nabla J (\hat{\g w}) = 2\g X^T\g X \hat{\g w} - 2 \g X^T\g y = \g 0 $$ which implies that $$ \g X^T\g X \hat{\g w} = \g X^T\g y. $$ Thus, the least squares solution $ \hat{\g w}$ can be found by solving a linear system, which can be performed very efficiently. In addition, we also have the following closed form for the solution.

If the number of data $N$ is greater than the dimension $d$ and the matrix $\g X$ has full rank ($rank(\g X) = d$), then the least squares solution can be expressed in terms of the matrix inverse of $\g X^T\g X$ as $$ \hat{\g w} = (\g X^T\g X )^{-1} \g X^T\g y. $$

Maximum likelihood inference with additive Gaussian noise

The least squares method can be recovered by application of the maximum likelihood inference principle under the assumption of an additive Gaussian noise model $$ Y = X^T \g w + V $$ with a noise term $V$ of probability density function $$ p_V(v) = \frac{1}{\sigma_V\sqrt{2}}\exp\left( \frac{-v^2}{2\sigma_V^2} \right) . $$

In the continuous case of regression, the likelihood is $\prod_{i=1}^N p_{Y|X}(y_i\ |\ \g x_i ) p_X(\g x_i)$, where the additive Gaussian noise model leads to $$ p_{Y|X}(y\ |\ \g x ) = \frac{1}{\sigma_V\sqrt{2}}\exp\left( \frac{-(y-\g w^T \g x)^2}{2\sigma_V^2} \right) . $$ Thus, the maximization of the log-likelihood can be expressed as $$ \max_{\g w\in\R^d} \ \sum_{i=1}^N \frac{-(y_i-\g w^T \g x_i)^2}{2\sigma_V^2} + \log \frac{1}{\sigma_V\sqrt{2}} + \log p_X(\g x_i) . $$ Removing the constants from the objective function, we get $$ \hat{\g w} = \arg\max_{\g w\in\R^d} \ - \sum_{i=1}^N (y_i - \g w^T\g x_i)^2 , $$ which is equivalent to the minimization of the sum of squared errors presented above.