The $K$-nearest neighbors algorithm implements a very natural idea for classification: given a point, classify it as the closest point in the training set (for which we know the label). Since this procedure can be very sensitive to the labels of points in the training set, it is rather error prone with small training sets (remember that unless we are in the deterministic case, the label of a point can only be correct up to a certain probability). To improve the robustness of the method, we predict the label of a point by looking at the majority class among a given number of the nearest neighbors of the point.

The $K$-nearest neighbors algorithm also enjoys a more formal justification. It can be interpreted as an empirical approximation of the Bayes classifier. More precisely, the algorithm estimates the probability of the input point being in a class by computing the fraction of training points of this class in the vicinity of the point. Then, the same decision rule as the one used by the Bayes classifier is applied: the class corresponding to the maximal probability is retained for the prediction; and this class is the majority class among the neighbors.

This algorithm has one major hyperparameter, which is the number of neighbors to look at. When this number is close to one, the classification is very sensitive to the training data. By increasing the number of neighbors, the classification corresponds to a compromise among the neighbors and the more neighbors we have, the more "average" the compromise is. Technically, this results in smoother boundaries between the classes.

The accuracy of the $K$-nearest neighbors is related to the presence in the training set of data points close to the input vector for which we want to predict the label. Though it seems that large training sets should satisfy this requirement, this is not necessarily the case in high dimension due to the so-called curse of dimensionality.

Regarding computational efficiency, the training phase of the $K$-nearest neighbors algorithm is almost instantaneous, as it only amounts to storing the training data in memory, but the memory requirement is linear in the number of data. In the test phase, when the algorithm predicts the label of a test point, the computational time can be high (linear in the number of training data and their dimension) since the algorithm computes all the distances between the test point and the training points. However, different techniques exist to reduce the number of required computations.

The output of the $K$-nearest neighbors algorithm is given by $$ f(\g x) = \arg\max_{y\in \mathcal{Y}} \sum_{i \in N_K(\g x)} \I{y_i = y} $$ where $N_K(\g x)$ is the set of $K$-nearest neighbors from $\g x$ in the training set $\{\g x_i\}_{i=1}^N$ and $\I{y_i = y}$ denotes the indicator function.

Depending on the choice of distance function used to compute the neighborhood $N_K(\g x)$, different variants of the algorithm can be obtained. The most common choice for Euclidean spaces $\X\subseteq \R^d$ is the Euclidean distance $$ dist(\g x, \g x_i ) = \|\g x - \g x_i\|_2 = \sqrt{(\g x - \g x_i)^T (\g x - \g x_i) } , $$ based on the Euclidean norm, but other distances or norms can also be used.

Consider a binary classification setting with $\Y= \{-1,+1\}$. In this case, the output of the $K$-nearest neighbors algorithm can be equivalently written as $$ f(\g x) = \sign\left( \frac{1}{K} \sum_{i \in N_K(\g x)} y_i\right) , $$ while the Bayes classifier can be written as $$ t(\g x) = \sign\left( \E_{Y|X=\g x} [ Y\ |\ X=\g x] \right) $$ where $ \E_{Y|X=\g x} [ Y\ |\ X=\g x]$ is known as the regression function.

With these formulations, we see that the $K$-nearest neighbors algorithm applies the same decision rule as the Bayes classifier, but with an empirical average of the observed values $y_i$ instead of the conditional expectation of $Y$. Also note that in this empirical estimate, we relax the conditioning $X=\g x$ to the condition $\g x_i\in N_K(\g x)$.

The most computationally intensive step in the $K$-nearest neighbors algorithm is the search for the nearest neighbors $N_K(\g x)$ when predicting the label of $\g x$. A straightforward search computes all distances $dist(\g x, \g x_i)$, $i=1,\dots,N$, in a number of operations that grows linearly with both $d$ and $N$.

A number of variants and techniques exist to reduce this computational cost. If the dimension $d$ is small, the training data can be stored in a structured form, typically based on a tree, in order to quickly direct the search for the nearest neighbors towards relevant subsets of points. For high-dimensional data, the trick discussed in the exercise below results in significant speed improvements.

- Write down the algorithm (in pseudo-code or any programming language you like) of the straightforward search for the $K$-nearest neighbors $N_K(\g x)$ based on the Euclidean distance.

- Let $NNindexes$ be the list of the nearest neighbors indexes and $NNdistances$ the list of the corresponding distances, both initialized to empty lists and limited to $K$ elements (any element pushed to a list index larger than $K$ is assumed to be dropped from the list).
- FOR $i=1,\dots,N$,
- $dist_i \leftarrow 0$
- FOR $j=1,\dots,d$,
- $dist_i\leftarrow dist_i + (x_j - x_{ij})^2$

- $dist_i \leftarrow \sqrt{dist_i}$
- IF $i \leq K$ OR $dist_i < NNdistances(K)$, insert $i$ in $NNindexes$ and $dist_i$ in $NNdistances$. Sort $NNdistances$ in the increasing order and reorder $NNindexes$ accordingly.

- Optimize the algorithm obtained in question 1 such that most computations are avoided in high dimension. Hint: there is a trick that requires very few modifications of the algorithm and results in drastic speed improvement with the Euclidean distance as well as any distance based on an $\ell_p$-norm.

First, note that using square distances instead of distances results in the same ordering of the points so that we can spare the square root in the algorithm above (or the $1/p$ power if we consider a distance based on an $\ell_p$-norm).

To avoid most computations, we will use the partial distance trick. The trick relies on the fact that $$ \sum_{j=1}^r |x_j - x_{ij}|^p $$ is an increasing function of $r$ that equals $\|\g x - \g x_i\|_p^p$ when $r=d$. Thus, if $K$ neighbors have already been found with square distances (or distances to the power $p$) to $\g x$ smaller than the sum above for some $r < d$, we know that $\g x_i$ cannot be one of the $K$-nearest neighbors and the remaining $d-r$ terms in the sum need not be computed.

To sum up, to apply the trick in the case of the Euclidean distance, it suffices to remove the line- $dist_i \leftarrow \sqrt{dist_i}$

- FOR $j=1,\dots,d$,
- $dist_i\leftarrow dist_i + (x_j - x_{ij})^2$

- $j \leftarrow 1$
- WHILE $j \leq d$ AND ($i \leq K$ OR $dist_i < NNdistances(K)$ ),
- $dist_i\leftarrow dist_i + (x_j - x_{ij})^2$

For the $K$-nearest neighbors algorithm, the selection of the best number of neighbors according to an estimation of the risk with the leave-one-out method can be implemented with a computational cost close to the cost of evaluating the training error.

Let $D = \{(\g x_i, y_i)\}_{i=1}^N$ be the training set and recall that, for classification, the leave-one-out estimate of the risk is $$ R_{LOO}(f) = \frac{1}{N} \sum_{i=1}^N \I{ y_i \neq f_i( \g x_i ) } , $$ where $f_i$ is the model trained on $D \setminus \{(\g x_i, y_i)\}$. For the $K$-nearest neighbors algorithm (the procedure can be similarly devised for its regression version), we have $$ f_i(\g x_i) = \arg\max_{y\in \mathcal{Y}} \sum_{j \in N_{K+1}(\g x_i) \setminus \{\g x_i\} } \I{y_j = y} $$ where $N_{K+1}(\g x_i)$ is the set of $(K+1)$-nearest neighbors from $\g x_i$ in the training set from which we remove $\g x_i$ itself.

Given an upper bound $\overline{K}$ on $K$, a standard model selection procedure for tuning $K$ is to compute $R_{LOO}(f)$ for all $K\leq \overline{K}$. The most demanding computation in such a procedure is the search for the neighbors, which must be carried out $\overline{K} N$ times.

However, the implementation of the $K$-nearest neighbors search giving $N_{K+1}(\g x_i)$ can easily be made such that it returns an ordered list of neighbors with increasing distance from $\g x_i$. For any $K \leq \overline{K}$, let $N_{\overline{K}+1}(\g x_i)_{2:K+1}$ denote the list of neighbors obtained by keeping only the items of index between 2 and $K+1$ in that list. Then, the output of the $K$-nearest neighbors algorithm is given, for any $K < \overline{K}$, by $$ f_i(\g x_i) = \arg\max_{y\in \mathcal{Y}} \sum_{j \in N_{\overline{K}+1}(\g x_i)_{2:K+1} } \I{y_j = y} $$ with a single search for $\overline{K}+1$ neighbors.