The kernel trick

In words...

The kernel trick provides a convenient technique to extend linear methods to the nonlinear case. Compared with a classical nonlinear mapping in feature space, it avoids both the computational burden related to high dimensional feature spaces and the curse of dimensionality regarding the lack of data required to learn a function in such spaces.

This trick applies to any linear method that only involves the input vectors via dot products between data points. If this is the case, the method can be extended to the nonlinear case by substituting a kernel function for these dot products. The most famous of such methods that popularized the kernel trick is the support vector machine.

In pictures...

Kernel functions allow for faster computation of dot products in feature space

Here, we illustrate the computational benefits of kernel functions for building polynomial models of a given degree $D$ when compared to linear models taking a vector of monomials as input.

The plot below shows the computational complexity (number of floating-point operations) for computing dot products in the feature space induced by the nonlinear mapping of the input vectors to the concatenation of all monomials of maximal degree $D$, either explicitly as $\inner{\phi(\g x)}{\phi(\g x^\prime)}$ or implicitly via the polynomial kernel function $K(\g x, \g x^\prime) = (\g x^T\g x^\prime + 1)^{D}$.

With a -dimensional input space,
and a degree $D = $ ,

the explicit computation of a dot product in feature space requires operations,

while the kernel function can be evaluated with only operations,

which is times faster.

Hover over the curves to see how this difference increases with the input space dimension; then try different values for the degree $D$.

In maths...

Consider a nonlinear model based on a nonlinear mapping in feature space: $$ f(\g x) = \inner{\g w}{\phi(\g x)}_{\X^\phi} + b . $$

If the parameter vector $\g w$ of the model $f$ can be expressed as a linear combination of the data mapped in feature space, $$ \g w = \sum_{i=1}^N \beta_i \phi(\g x_i) , $$ then the model $f$ involves only inner products as $$ f(\g x) = \sum_{i=1}^N \beta_i\inner{ \phi(\g x_i)}{\phi(\g x)}_{\X^\phi} + b . $$

Assume for now that there is a function $K : \X\times \X \rightarrow \R$ such that $$ \forall (\g x, \g x^\prime) \in \X^2,\quad K(\g x, \g x^\prime) = \inner{ \phi(\g x)}{\phi(\g x^\prime)}_{\X^\phi} $$ and that can efficiently compute inner products in feature space. Then, we can reformulate the model as $$ f(\g x) = \sum_{i=1}^N \beta_i K(\g x_i, \g x) + b . $$ Observe that in this formulation both the nonlinear map $\phi$ and the induced feature space $\X^\phi$ become implicit, such that no computation needs to be performed in feature space.
In addition, if a linear learning algorithm only involves the data points via dot products between points, $\g x_i^T\g x_j$, it can be extended to the nonlinear case by replacing these dot products by $K(\g x_i, \g x_j)$.
This is the basis of the kernel trick. The rest of the trick is that:

  1. we know that such functions $K$ exist and these are called kernel functions;
  2. we know that for a class of learning machines (such as SVMs), the solution to their training problem can be expressed as a linear combination just as $\g w$ above.

Example: the polynomial kernel

Recall the nonlinear mapping used as an example to introduce feature spaces, $$ \phi: \begin{bmatrix}x_1\\x_2 \end{bmatrix} \mapsto \begin{bmatrix} x_1^2\\\sqrt{2}x_1x_2\\x_2^2 \end{bmatrix} , $$ and let us compute a dot product in the induced feature space (here, $\R^3$): \begin{align} \inner{\phi(\g x)}{\phi(\g x^\prime)} &= \inner{ \begin{bmatrix} x_1^2\\\sqrt{2}x_1x_2\\x_2^2 \end{bmatrix}}{ \begin{bmatrix} {x_1^\prime}^2\\\sqrt{2}x_1^\prime x_2^\prime\\{x_2^\prime}^2 \end{bmatrix}} = x_1^2 {x_1^\prime}^2 + 2 x_1x_2 x_1^\prime x_2^\prime + x_2^2 {x_2^\prime}^2 \\ &= (x_1 {x_1^\prime} + x_2 {x_2^\prime} )^2 \\ &= \inner{ \g x}{\g x^\prime} ^2 \\ &= K( \g x, \g x^\prime ) . \end{align} Here, we used the definition of the homogeneous polynomial kernel function of degree 2 in the last line. Thus, inner products in feature space can be computed efficiently from dot products in the original input space $\X$ only. In fact, this trick also applies to higher-degree polynomials, the computational benefit of which is illustrated in the picture above.