Regression trees

In words...

Regression trees are similar to decision trees but for regression. They partition the input space into a set of rectangles and assign the same label to all input vectors falling in a rectangle, thus producing a piecewise constant model. However, they can easily be modified to produce piecewise linear models.

A partition built by a regression tree can be represented by a binary tree, where each node splits a single feature (a component of the input vector) into two regions above and below a splitting threshold. Given an input vector, the prediction is easily computed by propagating the vector from the root node to a leaf of the tree according to the tests on the features implemented at each node. In this representation, each node corresponds to a rectangular region of the input space: the rectangles of two children nodes form a partition of the rectangle of the parent node and the union of all leaf node rectangles form the final partition of the input space.

Training a regression tree requires to grow the tree and assign labels to its leaves. To grow a tree, we start with an empty tree and choose a feature to be split by the root node and a threshold, thus creating a node with two branches. Then, we recursively apply this scheme to every branch of the tree. At each node, the choice of feature to split and threshold is guided by the minimization of the error on the training set, as in the empirical risk minimization principle. More precisely, we work at each node with the subset of training data that propagate to this node from the root node (for which we use the whole training set), i.e., that fall within the corresponding rectangle. Based on this subset of data, we choose the feature and threshold that maximize the label coherence in the two branches, in order to quickly obtain a node associated with a subset of data with approximately the same label. When this occurs, there is no need to further divide the subset and the corresponding rectangular region is associated with that label.

Since the precise values of the input variables are not explicitly used (we only test whether they are above or below a threshold), regression trees easily extend to qualitative input variables taking a finite number of discrete values.

In pictures...

Partitioning a 1D input space with a regression tree

Partitioning a 2D input space with a regression tree

Below is a regression tree trained on the data set shown on the right.

Hover over the nodes of the tree to visualize the corresponding rectangular region in input space.


You can add points to the data set (choose the label with the mouse wheel and click to add a point) or
Then, or

In maths...

A regression tree is a classifier that partitions the input space $\X\subset\R^d$ into a set of rectangular regions $R_j$, $j=1,\dots,r$, and assigns the label $l_j$ to all input vectors $\g x_i$ falling in a region $R_j$: $$ f(\g x) = \sum_{j=1}^r l_j \I{\g x \in R_j} . $$ By restricting the regions to be (hyper)rectangles, the partition can be conveniently represented by a binary tree, in which each node $n_k$ implements a test on a single feature as $$ x_{j_k} \overset{?}{\leq} s_k $$ which splits the $j_k$-axis according to the threshold $s_k$ to create two branches. Therefore, each node in the tree already corresponds to a rectangular region $R_{n_k}$ of the input space, while its two children nodes correspond to two complementary subsets of this region.

Propagating the training set $\{(\g x_i, y_i)\}_{i=1}^N$ down the tree, we obtain, at each node $n_k$, subsets of data from which we can compute a preliminary classification based on a majority vote for the input vectors falling within the corresponding region: $$ \forall \g x \in R_{n_k} ,\quad f_{n_k}(\g x) = l_{n_k} = \frac{1}{N_k} \sum_{\g x_i \in R_{n_k}} y_i $$ where $N_k = \sum_{i=1}^N \I{\g x_i\in R_{n_k}}$ is the number of training data associated to the node $n_k$.

The same rule is used to define the labels associated with the leaves of the tree yielding the final classification. Therefore, in order to minimize the error while growing the tree, the splitting features $j_k$ and thresholds $s_k$ are determined during the training phase such that the subsets of data associated to the branches are maximally coherent in terms of label.

Formally, we define the node impurity as the "local" mean squared error, $$ I(n_k) = \frac{1}{N_k} \sum_{\g x_i \in R_{n_k} } (y_i - l_{n_k} )^2 $$ and use $$ (j_k, s_k) = \arg\min_{j,s}\ I(n_k^1) + I(n_k^2) $$ with $n_k^1$ and $n_k^2$ the two children nodes of $n_k$. Note that the node impurity can be defined in different ways, for instance using the Gini index or the cross-entropy.

For any fixed $j$, the optimal threshold $s$ can be quickly determined by noting that it suffices to test the $N_k$ values of $x_j$ found in the data set: $$ s_k \in \{ x_{ij_k} : \g x_i \in R_{n_k} \}. $$ Therefore, by scanning over all $j \in\{1,\dots,d\}$ and all $\g x_i\in R_{n_k}$, we can solve the minimization over $(j,s)$ above.

Training algorithm

The following pseudo code can be used to train a regression tree.
// D:  	 training set (x_i, y_i), i=1,...,N.
// epsilon: threshold on the mean square error allowed in a leaf

tree = createNode( D, [1,...,N] )


FUNCTION createNode ( D, indexes ) 	
	// D: 		entire training set (x_i, y_i), i=1,...,N.
	// indexes:	list of indexes of the training data associated to the current node and to be splitted
	
	// Compute the node label
	node.label = sum_{i in indexes} y_i / sum_{i in indexes} 1

	IF sum_{i in indexes} (y_i - node.label)^2 > epsilon	// otherwise the node is a leaf that need not be splitted
		// Compute the best split for this node
	
		best_I = Infinity	// best sum of children nodes impurity found so far
		FOR j = 1...d		// scan all features
			FOR i in indexes	// for each feature, test all thresholds s leading to different splits
				s = x_ij	// the thresholds can be chosen from the values of the jth feature in the data set
				[idx1, idx2] = split( D, indexes, j, s )	// split the data set according to the candidate (j,s)
				I1 = impurity(D, idx1)	// compute node impurity of the resulting children
				I2 = impurity(D, idx2)	
				IF I1 + I2 < best_I	// and check if it is better than the previous ones
					best_j = j
					best_s = s
				ENDIF
			ENDFOR
		ENDFOR
		// Create the node with its children	
		node.j = best_j
		node.s = best_s
		[idx1, idx2] = split( D, indexes, best_j, best_s )	// split the data set according to the best (j,s)
		node.child1 = createNode(D, idx1 )
		node.child2 = createNode(D, idx2 )
	ENDIF
RETURN node