The $K$-nearest neighbors algorithm implements a very natural idea for regression (a variant also exists for classification): given a point, output the same label value as the one of the closest point in the training set. Since this procedure can be very sensitive to the labels of points in the training set, it is rather error prone with small training sets (remember that unless we are in the deterministic case, the label of a point can only be correct up to a noise term). To improve the robustness of the method, we predict the label of a point by looking at the average of the labels among a given number of the nearest neighbors of the point.

The $K$-nearest neighbors algorithm also enjoys a more formal justification. It can be interpreted as an empirical approximation of the regression function. More precisely, the algorithm estimates the conditional expectation of the label by its average in the vicinity of the point.

The resulting model shows a staircase effect: the function is piecewise constant, with each piece corresponding to a region of the input space with a constant set of nearest neighbors.

This algorithm has one major hyperparameter, which is the number of neighbors to look at. When this number is close to one, the regression is very sensitive to the training data. By increasing the number of neighbors, the prediction is smoothed by averaging over a larger number of neighbors. Though the resulting is still piecewise constant and thus not smooth due to jumps between pieces, the number of pieces and jumps decreases when increasing $K$.

The accuracy of the $K$-nearest neighbors is related to the presence in the training set of data points close to the input vector for which we want to predict the label. Though it seems that large training sets should satisfy this requirement, this is not necessarily the case in high dimension due to the so-called curse of dimensionality.

Regarding computational efficiency, the training phase of the $K$-nearest neighbors algorithm is almost instantaneous, as it only amounts to storing the training data in memory, but the memory requirement is linear in the number of data. In the test phase, when the algorithm predicts the label of a test point, the computational time can be high (linear in the number of training data and their dimension) since the algorithm computes all the distances between the test point and the training points. However, different techniques exist to reduce the number of required computations, see for instance the implementation section of the $K$-nearest neighbors for classification page.

Try different values of $K$ to see its influence on the staircase effect and the "smoothing" of the function.

Given a training set $\{ ( \g x_i, y_i )\}_{i=1}^N \in (\X\times\Y)^N$, the output of the $K$-nearest neighbors algorithm for regression is given by $$ f(\g x) = \frac{1}{K} \sum_{i \in N_K(\g x)} y_i $$ where $N_K(\g x)$ is the set of $K$-nearest neighbors from $\g x$ in the training set (more precisely, their indexes).

Depending on the choice of distance function used to compute the neighborhood $N_K(\g x)$, different variants of the algorithm can be obtained. The most common choice for Euclidean spaces $\X\subseteq \R^d$ is the Euclidean distance $$ dist(\g x, \g x_i ) = \|\g x - \g x_i\|_2 = \sqrt{(\g x - \g x_i)^T (\g x - \g x_i) } , $$ but other metrics, e.g., based on other norms, can also be used.

The regression function is the optimal model we aim at when performing regression. It is given by the conditional expectation of the label given the input: $$ \eta(\g x) = \E_{Y|X=\g x} [ Y\ |\ X=\g x] . $$

The $K$-nearest neighbors algorithm implements an estimate of this function computed as an empirical average while relaxing the conditioning $X=\g x$ to the condition $\g x_i\in N_K(\g x)$.