A straightforward technique to extend any linear method, such as the ones used in linear classification or linear regression, is to apply a nonlinear preprocessing of the data.
Thus, the linear methods can build linear models of the data in feature space, i.e., the space of the images of the data points via the nonlinear mapping representing a particular preprocessing.
However, this simple approach has a number of limitations. The nonlinear mapping must be well-chosen and suitable for the particular problem at hand. It should also remain computationally simple to allow for efficient training and predictions. And finally, it should not give rise to a too high-dimensional feature space, in which the linear methods might have difficulties in fitting the model due to the curse of dimensionality, here meaning a lack of data.
A convenient manner of circumventing these issues is provided by the kernel trick.
You can click in the plot on the left to add points and see where they are mapped to in feature space.
Consider a generic linear method implementing a linear (or affine) function of the input vectors:
$$
f(\g x) = \g w^T \g x + b.
$$
With a nonlinear map
\begin{align*}
\phi : & \X \rightarrow \X^\phi \\
& \g x \mapsto \phi(\g x) ,
\end{align*}
the data are projected in the
Note that, for the sake of generality, we left the matrix notation of the dot product and now use $\inner{\cdot}{\cdot}_{\X^\phi}$ to denote the standard inner product in $\X^\phi$. In particular, this notation paves the way for infinite-dimensional feature spaces.
Summing up, by choosing a nonlinear map $\phi$, we can train a nonlinear model $$ f(\g x) = \inner{\g w}{\phi(\g x)}_{\X^\phi} + b $$ with a linear method.