# Polynomial regression

## In words...

Polynomial regression is a particular case of nonlinear regression where the model takes the form of a polynomial function. Its estimation is typically performed by applying the least squares method in a feature space where the data are mapped to vectors of monomials.

Depending on the chosen degree of the polynomial, the model can be more or less flexible and provides more or less accurate approximations of nonlinear target functions. However, keep in mind that the larger the degree is, the larger the training set should be to allow for an accurate estimation of the parameters. This is due to the fact that a polynomial model has a number of parameters that quickly grows with the degree.

The degree of the polynomial model is the main hyperparameter of the method and in practice is typically tuned by cross-validation procedures.

## In pictures...

### Polynomial regression via least squares

Here is an example of polynomial regression in dimension 1, i.e., with a single input variable $x$.

We approximate the target function $t(x)=\sin(x)$
from $N =$ data points in the training set.

The least squares method yields
the polynomial model

of degree

$f(x) =$

You can click in the plot to add data points and see how the model reacts.

## In maths...

In polynomial regression, the nonlinear target function is approximated by a polynomial model of degree $n$: $$f (\g x) = \sum_{k=1}^m w_k \g x^{\alpha_k} + w_0$$ where $\{\alpha_k\}_{k=1}^m$ is the set of $m = \sum_{q=1}^n\begin{pmatrix}d + q-1\\q\end{pmatrix}$ multi-indexes with $1\leq |\alpha_k| = \sum_{j=1}^d (\alpha_k)_j \leq n$ and $$\g x^{\alpha_k} = \prod_{j=1}^d x_j^{(\alpha_k)_j}$$ are the corresponding monomials.

Here, we consider the parametric case where the degree $n$ is fixed and the $m+1$ parameters, $\g w = [w_0, w_1,\dots,w_m]^T\in\R^{m+1}$, of the model are estimated from a training set $\{(\g x_i, y_i)\}_{i=1}^N$. Applying the ERM principle with the squared loss function, as is standard in regression, this estimation is performed by solving $$\min_{f\in\F^n} \ \frac{1}{N}\sum_{i=1}^N \ell(f(\g x_i), y_i)$$ where the model $f$ is restricted to the class of polynomials of degree $n$: $$\F^n = \left\{ f\in \R^\X\ :\ f (\g x) = \sum_{k=1}^m w_k \g x^{\alpha_k} + w_0, \ \g w \in \R^{m+1} \right\}.$$

This nonlinear regression problem can be posed as a linear one. Consider the explicit nonlinear map \begin{align*} \phi : & \X \rightarrow \X^\phi \\ & \g x \mapsto \phi(\g x) = [1, \g x^{\alpha_1}, \dots, \g x^{\alpha_m}]^T \end{align*} that projects the data into the feature space $\X^\phi$ of monomials $\g x^{\alpha_k}$. In this feature space, we can build a linear model corresponding to a polynomial model in the original input space as $$f(\g x) = \inner{\g w}{\phi(\g x)}_{\X^\phi} = \sum_{k=1}^m w_k \g x^{\alpha_k} + w_0 .$$

For instance, using the least squares method for the estimation, we obtain $$\hat{\g w} = (\g \Phi ^T \g \Phi )^{-1} \g \Phi^T \g y ,$$ where $\g y = [y_1,\dots, y_N]^T$ and $\g \Phi = [\phi(\g x_1),\ \dots,\ \phi(\g x_N)]^T$. However, note that the matrix $\g\Phi$ is typically badly conditioned and that its columns should be normalized in practice (column normalization also implies a rescaling of the coefficients in $\hat{\g w}$ after the estimation).