The algorithm alternatively minimizes the sum of squared errors with respect to the model parameters and the classification of data points into modes. Its computational efficiency comes from the fact that the minimization with respect to the model parameters amounts to a least squares problem while the classification is explicitly given by a simple calculation.

Indeed, given the models, the classification minimizing the overall error is the one that assigns each data point to the model providing its best approximation. On the other hand, given the classification, the switching regression problem with $K$ modes amounts to $K$ independent regression problems, solved here by the least squares method.

Variants of this algorithm can be conceived in a straightforward manner by using a different loss function with a different model estimation method.

In pictures...

The $K$-LinReg algorithm step-by-step

The plot below shows how the $K$-LinReg algorithm solves a switching regression with $K=3$ modes and $N=150$ data points.

Empirical performance of $K$-LinReg

Experiment with the application below to test the performance of $K$-LinReg with multiple restarts. In particular, observe that the number of successful restarts increases with the number of data.

Simulation data:
Number of data:
Number of modes:

Gaussian noise variance:

Algorithm:
Number of models:
Number of restarts:

Estimated models:

Computing time =
Mean Squared Error =

Results over all the restarts (each bar represents the MSE obtained from one of the initializations):

green bars indicate possible successes (when MSE equals the best MSE over all restarts) red bars indicate failures (when a mode gets no data points)

In maths...

The $K$-LinReg algorithm solves switching linear regression problems by minimizing the global Mean Squared Error over $ N $ data points and $ K $ models:
$$
\min_{\{\g w_j\}_{j=1}^K}\ \frac{1}{N} \sum_{i=1}^N \min_{j=1,\dots, K} (y_i - \g w_j^T \g x_i )^2 .
$$
Since this optimization problem is nonsmooth and nonconvex, we cannot guarantee that the global optimum will be found. However, $K$-LinReg provides an efficient heuristic with satisfactory accuracy in practice, particularly when restarted from multiple random initializations.

Schematically, the algorithm iterates over the two following steps.

Given the model parameters $\{\g w_j\}_{j=1}^K$, classify the data points by associating each point with model that best approximates it:
$$
q_i = \arg\min_{j=1,\dots,K} ( y_i - \g w_j^T \g x_i )^2 ,\quad i=1,\dots, N .
$$

Given the classification of the points, estimate a model for each mode from the points associated to this mode:
$$
\g w_j = \arg\min_{\g w}\ \sum_{\{i : q_i = j\}} (y_i - \g w_j^T \g x_i )^2 ,
$$
with a least squares solution
$$
\g w_j = (\g X_j^T \g X_j)^{-1} \g X_j^T \g y_j
$$
where $\g X_j$ is the data submatrix with the $\g x_i$ such that $q_i=j$ as rows and $\g y_j$ contains the corresponding values of $y_i$.

Convergence can be detected either in terms of classification (when $q_i$ does not change for any $i$ between two iterations) or in terms of the parameter vectors (when none of the parameter vectors changed significantly).