Regularized learning in RKHS and the representer theorem

In words...

The structure of an RKHS is particularly convenient for learning, in particular for regularized learning with a regularization term based on the RKHS norm. Indeed, in this case, the representer theorem states that the function minimizing the regularized empirical risk is a linear combination of the kernel functions computed at the training points, irrespective of the choice of loss function.

In other words, the solution lies in the span of the images of the training points in feature space when letting the RKHS be that feature space.

In pictures...

In maths...

Regularized learning in RKHS can be posed as the functional optimization problem $$ \min_{f\in\H} \ R_{emp}(f) + \lambda \Omega(\|f\|_{\H}) $$ where $\H$ is an RKHS, $R_{emp}(f)$ is the empirical risk of $f$, $$ R_{emp}(f) = \sum_{i=1}^N \ell(y_i, f(\g x_i)) $$ using a loss function $\ell$, $\Omega$ is a regularizer which penalizes functions $f$ with large norms, and $\lambda >0$ is a hyperparameter.

The representer theorem states that, if $\H$ is an RKHS of reproducing kernel $K$ and $\Omega : \R^+\rightarrow \R^+$ is a strictly increasing function, then the solution to the problem above is of the form $$ f = \sum_{i=1}^N \alpha_i K( \g x_i, \cdot) $$ with $\alpha_i\in\R$ and the $\g x_i$'s corresponding to the points in the training set (those involved in $R_{emp}$).

Proof: Let define the subspace induced by the images of the training points in $\H$, $$ \mathcal{S} = Span\left( \{K(\g x_i,\cdot)\}_{i=1}^N \right) , $$ and its orthogonal complement $$ \mathcal{S}^{\perp} = \left\{v \in \H\ :\ \inner{u}{v}_{\H} = 0,\ \forall u \in \mathcal{S} \right\} . $$ Since $\H$ is a Hilbert space, any function $f\in \H$ can be decomposed into a part in $\mathcal{S}$ and a part which is orthogonal to it, i.e., $$ f=u+v, \quad \mbox{ with } u\in\mathcal{S} \ \mbox{ and }\ v\in\mathcal{S}^\perp. $$ For all $\g x_i$'s in the training set, we can compute \begin{align*} f(\g x_i) &= \inner{K(\g x_i, \cdot)}{f}_{\H} & (\mbox{due to the reproducing property of } K ) \\ &= \inner{K(\g x_i, \cdot)}{ u + v}_{\H} \\ &= \inner{K(\g x_i, \cdot)}{ u }_{\H} + \inner{K(\g x_i, \cdot)}{ v}_{\H} & (\mbox{due to the distributivity of the inner product } ) \\ &= \inner{K(\g x_i, \cdot)}{ u}_{\H} & \left(\mbox{due to } K(\g x_i, \cdot)\in\mathcal{S} \mbox{ and } v\in\mathcal{S}^\perp \ \Rightarrow\ \inner{K(\g x_i, \cdot)}{ v}_{\H} =0 \right)\\ & = u(\g x_i) & (\mbox{due to the reproducing property of } K) . \end{align*} As a consequence, the output of $f$ at any training point does not depend on $v$ and neither does the empirical risk: $$ R_{emp}(f) = R_{emp}(u). $$
On the other hand, we have \begin{align*} \|f\|_{\H} &= \|u+v\|_{\H} \\ &= \sqrt{\inner{u+v}{u+v}_{\H} } \\ &= \sqrt{\inner{u}{u}_{H} + \inner{v}{v}_{\H} + 2\inner{u}{v}_{\H} } \\ &= \sqrt{\|u\|_{\H}^2 + \|v\|_{\H}^2 + 2\inner{u}{v}_{\H}} \\ &= \sqrt{\|u\|_{\H}^2 + \|v\|_{\H}^2 } & \left(\mbox{due to } u\in\mathcal{S} \mbox{ and } v\in\mathcal{S}^\perp \ \Rightarrow\ \inner{u}{ v}_{\H} =0 \right). \end{align*} Thus, we obtain that \begin{align*} v\neq 0 \quad \Rightarrow\quad \|v\|_{\H} > 0 &\quad \Rightarrow\quad \|f\|_{\H} > \|u\|_{\H} \\ &\quad \Rightarrow\quad \Omega(\|f\|_{\H}) > \Omega(\|u\|_{\H})& (\Omega \mbox{ is strictly increasing} ). \end{align*} Therefore, for any $f = u+v$ with $v\neq 0$, there is a function $f^* = u\in \mathcal{S}$ with same empirical risk but smaller norm, i.e., a function leading to a lower value of the objective function. Conversely, $f$ cannot be a minimizer unless its part in $\mathcal{S}^\perp$ is null, i.e., unless it belongs to $\mathcal{S}$.