Kernel ridge regression

In words...

Kernel ridge regression extends ridge regression to the nonlinear case by learning a function in a reproducing kernel Hilbert space (RKHS) instead of a linear model. Kernel ridge regression is perhaps the most straightforward instance of regularized learning in RKHS which is made possible thanks to the representer theorem.

In pictures...

Learning the sinc function

The plot below illustrates the approximation of the sinc function by kernel ridge regression.
You can play with all the hyperparameters to observe their influence on the model.

The target function is $\displaystyle{ \mbox{sinc}(x) = \frac{\sin ( \pi x) }{\pi x}}$
Choose a number of data: $N = $

a kernel :

a kernel parameter:

and a regularization constant: $\lambda = $

Kernel ridge regression yields the model

$f(x) = \sum_{i=1}^N \alpha_i $ $\exp\left(\frac{-(x-x_i)^2}{2\sigma^2}\right)$ $\left(x_i x + 1\right)^\gamma$ $\left(x_i x\right)^\gamma$ $\ x_i x$ with $\sigma = $ with $\gamma = $ with $\gamma = $

You can click in the plot to add data points and see how the model reacts.

$x$ $y$

In maths...

Kernel ridge regression corresponds to the regularized learning problem in reproducing kernel Hilbert space (RKHS) written as $$ \hat{f} = \arg\min_{f\in\H} \ \sum_{i=1}^N (y_i - f(\g x_i))^2 + \lambda \|f\|_{\H}^2 , $$ where $\lambda>0$ is the hyperparameter that tunes the trade-off between the fit to the data measured by the squared loss function, $\ell(y,\hat{y}) = (y-\hat{y})^2$, and the control of the model complexity measured via the RKHS norm of $f$.

The representer theorem states that the solution to this problem is of the form $$ f = \sum_{i=1}^N \alpha_i K(\g x_i, \cdot) $$ with the $\g x_i$'s from the training set. Using this and the definition of the RKHS norm, we can reformulate the functional optimization problem above as the finite-dimensional quadratic program in the parameter vector $\g \alpha = [\alpha_1, \dots, \alpha_N]^T$: $$ \hat{\g \alpha} = \arg\min_{\g \alpha\in\R^N} \ \sum_{i=1}^N \left(y_i - \sum_{j=1}^N \alpha_j K(\g x_j,\g x_i) \right)^2 + \lambda \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j K(\g x_i, \g x_j) . $$ Introducing the target vector $\g y=[y_1,\dots, y_N]^T$ and the kernel matrix $$ \g K = \begin{bmatrix} K(\g x_1, \g x_1) & \dots & K(\g x_1,\g x_N) \\ \vdots\\ K(\g x_N, \g x_1) & \dots & K(\g x_N,\g x_N) \\ \end{bmatrix}, $$ we rewrite this as $$ \hat{\g \alpha} = \arg\min_{\g \alpha\in\R^N} \ \| \g y - \g K \g \alpha \|_2^2 + \lambda \g \alpha^T \g K \g \alpha. $$

Following the derivation for the linear case, we can write the objective function as $$ \g \alpha^T ( \g K \g K + \lambda\g K) \g \alpha - 2 \g \alpha^T \g K\g y + \g y^T\g y $$ where we used the symmetry of the kernel matrix ($\g K^T = \g K$). At the optimum, the gradient of the objective function is zero and we have $$ 2 ( \g K \g K + \lambda\g K) \hat{\g \alpha} - 2 \g K\g y = \g 0 . $$ Therefore, we obtain $$ \hat{\g \alpha} = ( \g K \g K + \lambda\g K)^{-1} \g K\g y = ( \g K + \lambda\g I)^{-1} \g y $$