Nonlinear mapping in feature space

In words...

A straightforward technique to extend any linear method, such as the ones used in linear classification or linear regression, is to apply a nonlinear preprocessing of the data.

Thus, the linear methods can build linear models of the data in feature space, i.e., the space of the images of the data points via the nonlinear mapping representing a particular preprocessing.

However, this simple approach has a number of limitations. The nonlinear mapping must be well-chosen and suitable for the particular problem at hand. It should also remain computationally simple to allow for efficient training and predictions. And finally, it should not give rise to a too high-dimensional feature space, in which the linear methods might have difficulties in fitting the model due to the curse of dimensionality, here meaning a lack of data.

A convenient manner of circumventing these issues is provided by the kernel trick.

In pictures...

Linear separability in feature space

In the plots below, we illustrate how, with the nonlinear mapping $$ \phi: \begin{bmatrix}x_1\\x_2 \end{bmatrix} \mapsto \begin{bmatrix} x_1^2\\\sqrt{2}x_1x_2\\x_2^2 \end{bmatrix} , $$ a non-linearly separable data set in the 2-dimensional plane... becomes linearly separable in a 3-dimensional feature space.

Mapped to feature space =>

$\phi_1(\g x)$ $\phi_2(\g x)$ $\phi_3(\g x)$

You can click in the plot on the left to add points and see where they are mapped to in feature space.

In maths...

Consider a generic linear method implementing a linear (or affine) function of the input vectors: $$ f(\g x) = \g w^T \g x + b. $$ With a nonlinear map \begin{align*} \phi : & \X \rightarrow \X^\phi \\ & \g x \mapsto \phi(\g x) , \end{align*} the data are projected in the feature space $\X^\phi$ where a linear model $$ f^\prime(\g \phi(\g x)) = \inner{\g w}{\phi(\g x)}_{\X^\phi} + b,\qquad \mbox{with } \g w\in\X^\phi, $$ can be trained with a linear method.
Note that, for the sake of generality, we left the matrix notation of the dot product and now use $\inner{\cdot}{\cdot}_{\X^\phi}$ to denote the standard inner product in $\X^\phi$. In particular, this notation paves the way for infinite-dimensional feature spaces.

Summing up, by choosing a nonlinear map $\phi$, we can train a nonlinear model $$ f(\g x) = \inner{\g w}{\phi(\g x)}_{\X^\phi} + b $$ with a linear method.