Linear regression

In words...

Linear regression refers to the case where the model to be learned is a linear or affine function of the input vector. This ambiguity in the terms comes from the fact that an affine model can always be learned as a linear model in an extended input space.

Linear models have an obvious algorithmic advantage over nonlinear ones due their simplicity. In addition, they also offer a (rather extreme but however partial) answer to the issue of overfitting.

Many data sets can be sufficiently well approximated by a linear model and this has been done in many fields for many years. However, some more complex problems call for nonlinear regression methods.

Perhaps the most simple (and certainly the most popular) algorithm for linear regression is the least squares method, based on the squared loss function. This method is also at the basis of more advanced approaches such as ridge regression or the LASSO.

Other popular methods are typically based on the use of a different loss function, such as the absolute loss used in the least absolute deviation method or the $\epsilon$-insensitive loss used in support vector regression.

In pictures...

One-dimensional linear regression

Here is an example of linear regression in dimension 1, i.e., with a single input variable $x$.
In this case, the linear model is just a straight line passing through the cloud of data points.

Since it does not pass through the origin, this linear model is actually an affine model with two parameters: the slope $w = 0.5$ and the offset at the origin (or bias) $b=2$.

In maths...

Linear regression in $\X\subseteq\R^d$ uses linear models of the form $$ f(\g x) = \g w^T \g x $$ with the parameter vector $\g w \in\R^d$.

But linear regression also often refers to regression based on affine models of the form $$ f(\g x) = \g w^T \g x + b $$ with an additional parameter $b\in\R$.

Affine models seen as linear models

If a particular learning algorithm works for linear regression, it can also be applied to learn affine models in a straightforward manner.

The technique consists in applying the algorithm to extended input vectors $$ \tilde{\g x} = \begin{bmatrix}\g x\\1\end{bmatrix}, $$ in which we append a constant component equal to 1.
Then, affine regression in $\R^d$ from a data set $\{(\g x_i, y_i)\}_{i=1}^N \in (\R^d \times \R)^N$ can be posed as a purely linear regression problem in $\R^{d+1}$ with data set $$ \{(\tilde{\g x}_i, y_i)\}_{i=1}^N \in (\R^{d+1} \times \R)^N . $$ A linear algorithm applied to this data set returns a linear model $\tilde{f}$ with a parameter vector $\tilde{\g w}\in\R^{d+1}$, from which the parameters, $\g w \in\R^d$ and $b\in\R$, of the affine model $f$ can be recovered as $$ \tilde{\g w} = \begin{bmatrix}\g w\\b\end{bmatrix} $$ due to $$ \tilde{f}(\tilde{\g x}) = \tilde{\g w}^T \tilde{\g x} = \g w^T \g x + b = f(\g x) . $$