Decision trees

In words...

Decision trees are classifiers that partition the input space into a set of rectangles and assign the same label to all input vectors falling in a rectangle. Similar ideas are used by regression trees for regression.

Such partitions can be represented by a binary tree, where each node splits a single feature (a component of the input vector) into two regions above and below a splitting threshold. Given an input vector, the classification is easily computed by propagating the vector from the root node to a leaf of the tree according to the tests on the features implemented at each node. In this representation, each node corresponds to a rectangular region of the input space: the rectangles of two children nodes form a partition of the rectangle of the parent node and the union of all leaf node rectangles form the final partition of the input space.

Training a decision tree requires to grow the tree and assign labels to its leaves. To grow a tree, we start with an empty tree and choose a feature to be split by the root node and a threshold, thus creating a node with two branches. Then, we recursively apply this scheme to every branch of the tree. At each node, the choice of feature to split and threshold is guided by the minimization of the error on the training set, as in the empirical risk minimization principle. More precisely, we work at each node with the subset of training data that propagate to this node from the root node (for which we use the whole training set), i.e., that fall within the corresponding rectangle. Based on this subset of data, we choose the feature and threshold that maximize the label coherence in the two branches, in order to quickly obtain a node associated with a subset of data with the same label. When this occurs, there is no need to further divide the subset and the corresponding rectangular region is associated with that label.

In practice, one typically allows for a certain level of node impurity at the leaves of the tree, such that the label of a leaf is the label of the majority of training data associated to this leaf.

Since the precise values of the input variables are not explicitly used (we only test whether they are above or below a threshold), decision trees easily extend to qualitative input variables taking a finite number of discrete values.

In pictures...

Partitioning a 2D input space with a decision tree

Below is a decision tree trained on the data set shown on the right.

Hover over the nodes of the tree to visualize the corresponding rectangular region in input space.

You can add points to the data set (choose the label with the mouse wheel and click to add a point) or
Then, or

In maths...

A decision tree is a classifier that partitions the input space $\X\subset\R^d$ into a set of rectangular regions $R_j$, $j=1,\dots,r$, and assigns the label $l_j$ to all input vectors $\g x_i$ falling in a region $R_j$: $$ f(\g x) = \sum_{j=1}^r l_j \I{\g x \in R_j} . $$ By restricting the regions to be (hyper)rectangles, the partition can be conveniently represented by a binary tree, in which each node $n_k$ implements a test on a single feature as $$ x_{j_k} \overset{?}{\leq} s_k $$ which splits the $j_k$-axis according to the threshold $s_k$ to create two branches. Therefore, each node in the tree already corresponds to a rectangular region $R_{n_k}$ of the input space, while its two children nodes correspond to two complementary subsets of this region.

Propagating the training set $\{(\g x_i, y_i)\}_{i=1}^N$ down the tree, we obtain, at each node $n_k$, subsets of data from which we can compute a preliminary classification based on a majority vote for the input vectors falling within the corresponding region: $$ \forall \g x \in R_{n_k} ,\quad f_{n_k}(\g x) = l_{n_k} = \arg\max_{y\in\Y}\ \sum_{\g x_i \in R_{n_k}} \I{y_i = y} %= \arg\max_{y\in\Y}\ \sum_{\g i=1}^N \I{\g x_i\in R_{n_k}} \I{y_i = y} $$ which is a single constant label.

The same rule is used to define the labels associated with the leaves of the tree yielding the final classification. Therefore, in order to minimize the error while growing the tree, the splitting features $j_k$ and thresholds $s_k$ are determined during the training phase such that the subsets of data associated to the branches are maximally coherent in terms of label.

Formally, we define the node impurity as the "local" misclassification error, $$ I(n_k) = \frac{1}{N_k} \sum_{\g x_i \in R_{n_k} } \I{y_i \neq l_{n_k}} $$ where $N_k = \sum_{i=1}^N \I{\g x_i\in R_{n_k}}$ is the number of training data associated to the node $n_k$, and use $$ (j_k, s_k) = \arg\min_{j,s}\ N_k^1 I(n_k^1) + N_k^2 I(n_k^2) $$ with $n_k^1$ and $n_k^2$ the two children nodes of $n_k$ having $N_k^1$ and $N_k^2$ associated data, respectively. Note that the node impurity can be defined in different ways, for instance using the Gini index or the cross-entropy.

For any fixed $j$, the optimal threshold $s$ can be quickly determined by noting that it suffices to test the $N_k$ values of $x_j$ found in the data set: $$ s_k \in \{ x_{ij_k} : \g x_i \in R_{n_k} \}. $$ Therefore, by scanning over all $j \in\{1,\dots,d\}$ and all $\g x_i\in R_{n_k}$, we can solve the minimization over $(j,s)$ above.

Training algorithm

The following pseudo code can be used to train a decision tree.
// D:  	 training set (x_i, y_i), i=1,...,N.
// Nmin: threshold on the number of errors allowed in a leaf

tree = createNode( D, [1,...,N] )

FUNCTION createNode ( D, indexes ) 	
	// D: 		entire training set (x_i, y_i), i=1,...,N.
	// indexes:	list of indexes of the training data associated to the current node and to be splitted
	// Compute the node label
	node.label = argmax_y sum_{i in indexes} (y_i == y)
	IF sum_{i in indexes} (y_i != node.label) > Nmin	// otherwise the node is a leaf that need not be splitted
		// Compute the best split for this node
		best_I = Infinity	// best sum of children nodes impurity found so far
		FOR j = 1...d		// scan all features
			FOR i in indexes	// for each feature, test all thresholds s leading to different splits
				s = x_ij	// the thresholds can be chosen from the values of the jth feature in the data set
				[idx1, idx2] = split( D, indexes, j, s )	// split the data set according to the candidate (j,s)
				I1 = impurity(D, idx1)	// compute node impurity of the resulting children
				I2 = impurity(D, idx2)	
				IF I1*idx1.length + I2*idx2.length < best_I	// and check if it is better than the previous ones
					best_j = j
					best_s = s
					best_I = I1*idx1.length + I2*idx2.length
		// Create the node with its children	
		node.j = best_j
		node.s = best_s
		[idx1, idx2] = split( D, indexes, best_j, best_s )	// split the data set according to the best (j,s)
		node.child1 = createNode(D, idx1 )
		node.child2 = createNode(D, idx2 )