A straightforward technique to extend any linear regression method to the nonlinear case is to apply a preprocessing step mapping the data to a feature space where the problem becomes linear.

For regression, the feature space can be explicitly constructed in a straightforward manner if the structure of the desired nonlinear model is linear with respect to its parameters. Indeed, in this case, the data vectors are mapped to feature vectors with components corresponding to the nonlinear terms of the model.

If the nonlinear structure does not fit this criterion or if it is unknown, then one resorts to function approximation. This means that we have to fix a priori a different but suitable structure for the model, keeping in mind that this structure serves two purposes: it should provide sufficient flexibility to accuractely approximate the true inherent nonlinearities and it should be linear in its parameters to make their estimation easy. A typical example of such an approach is polynomial regression.

However, this simple approach has a number of limitations. The nonlinear mapping must be well-chosen and suitable for the particular problem at hand. In the absence of prior knowledge on the nonlinearities, a polynomial model with a large degree can be sufficiently accurate, but may also lead to both estimation and computational difficulties related to the curse of dimensionality.

Methods that circumvent these issues are often nonparametric methods that estimate both the parameters and the structure of the model from the data. Examples of such approaches are kernel ridge regression and support vector regression.

Consider a nonlinear map
\begin{align*}
\phi : & \X \rightarrow \X^\phi \\
& \g x \mapsto \phi(\g x)
\end{align*}
that projects the data into the

If the nonlinear map $\phi$ can be explicitly computed, then we can apply any linear method to estimate $f$. In particular, by introducing the matrix $$ \g \Phi = \begin{bmatrix} \phi(\g x_1) & \dots &\phi(\g x_N) \end{bmatrix}^T \in \R^{N\times dim(\X^\phi)} $$ the least squares method yields $$ \hat{\g w} = (\g \Phi ^T \g \Phi )^{-1} \g \Phi^T \g y . $$ where $\g y = [y_1,\dots, y_N]^T$.

However, this approach is only suitable for low-dimensional feature spaces $X^\phi$ for two reasons. The first one is that the least squares method requires the inversion of a matrix of size $dim(\X^\phi)\times dim(\X^\phi)$, which becomes computationally intractable for high dimensions. The second one is that the number of parameters to estimates (the dimension of $\g w$) is also equal to $dim(\X^\phi)$ and that we might need far more than $N$ data to estimate these accurately. Note that these two points are also related since with $N < dim(\X^\phi)$ the matrix $\g \Phi ^T \g \Phi$ is rank defficient and not invertible due to $rank(\g \Phi ^T \g \Phi) \leq rank(\g \Phi )\leq \min\{N, dim(\X^\phi)\}$.