Fast tuning of the Gaussian kernel parameter

In words...

Model selection is a computationally demanding task that typically requires to train the model on the same data multiple times with different hyperparameter values. For kernel methods, such as support vector machines or kernel ridge regression, a significant fraction of the training time is dedicated to kernel function evaluations.

However, if one stores these kernel function values, then many computations can be spared at the next training by updating these values instead of recomputing them from scratch.

In pictures...

In maths...

Efficient kernel matrix updates

Consider the Gaussian kernel function of bandwidth $\sigma>0$ defined by $$ \forall (\g x, \g x^\prime)\in\X^2,\quad K_{\sigma}(\g x, \g x^\prime) = \exp\left(- \frac{\|\g x - \g x^\prime\|_2^2}{2\sigma^2}\right) $$ and assume that the kernel matrix $$ \g K_{\sigma_1} = \begin{bmatrix} K_{\sigma_1}(\g x_1, \g x_1) & \dots & K_{\sigma_1}(\g x_1,\g x_N) \\ \vdots\\ K_{\sigma_1}(\g x_N, \g x_1) & \dots & K_{\sigma_1}(\g x_N,\g x_N) \\ \end{bmatrix} $$ has already been computed with kernel parameter $\sigma_1$.

Then, observe that $$ K_{\sigma_2}(\g x_i, \g x_j) = \left(K_{\sigma_1}(\g x_i, \g x_j)\right)^{(\sigma_1 / \sigma_2)^2}, $$ which means that the kernel matrix $\g K_{\sigma_2}$ can be efficiently computed independently of the input space dimension $dim(\g x_i)$ by taking an entrywise power of $\g K_{\sigma_1}$.

Choice of the grid of values for $\sigma$

When tuning $\sigma$, the discrete set of values $\{\sigma_1, \dots, \sigma_n\}$ at which we want to train model can also be chosen in order to allow for faster computations.

For instance, consider the set made of values $$ \sigma_k = 2^{-k/2} \beta,\quad k=0,\dots $$ for some scaling constant $\beta > 0$. Then, ordering the values in terms of $k$, we have $(\sigma_{k}/\sigma_{k+1})^2 = (2^{-k/2} 2^{(k+1)/2})^2 = 2$ and we can obtain the next kernel matrix with a simple entrywise product as $$ \g K_{\sigma_{k+1}} = \g K_{\sigma_{k}} \odot \g K_{\sigma_{k}} , $$ which can be efficiently implemented with $N(N-1)/2$ in-place multiplications K_ij *= K_ij, $1\leq i < j \leq N$, by using the symmetry of the kernel matrix and the fact that $K_{ii} = 1$ for the Gaussian kernel.