Decision Trees

Joshua Pickard
Geek Culture
Published in
4 min readDec 27, 2021

Decision trees work by recursively splitting data based on elements of the feature vectors that contain the most information. Based on Information Theory, we can define a simple model which is readily understandable by both humans and machines, even after it is done training.

A decision tree is a tree structure built from training data where each node denotes a test condition, the branches are defined based on the results of the tests, and the terminal nodes of the tree predict labels for data.

The test condition is defined using information theoretic criteria. Since information theory deals with both ordinal and categorical data, this allows decision trees to use both categorical and ordinal data.

Example

Consider the problem of predicting if a person is healthy based on: a persons age, if they eat pizza, and if they exercise. In this problem, age can be any real number, if they eat pizza and if they exercise are binary variables, True/False.

Image from google

We could use the decision tree above as a model for making these predictions. To illustrate how the tree works, consider a person described by variables A,P, and E, corresponding to age, pizza consumption, and exercise. Beginning at the root node, if A < 30 then the decision tree follows the left branch to the next node. The new test asks if P=True. Since this is a terminal node, the output of this test results in a prediction rather than a new test node. If P=True, then the model predicts the person is unfit. Otherwise, the model predicts the person is fit. However, returning to the root node, if A is not less than 30, then the decision tree follows the right branch. The next node asks if E=True. Similarly, this is a terminal node that will make a prediction. If E=True, then the model predicts the person is fit. Otherwise, the model predicts the person is unfit.

In all cases the value of A is used by the model, but note that not all variables are considered for every person. If A < 30 then the value of E is not considered. Similarly, if A is not less than 30, then the value of P is not considered. There is no general rule for decision trees to say which variables are always relevant and which are not. This is the result of how the decision tree is built, which is considered in the next sections.

Iterative Dichotomiser 3 (ID3)

For decision trees, training is the process of building the tree based on training data. This requires recursively calculating the optimal test at each node in the tree. This is done by selecting the feature that has the largest information gain.

Since building trees is a recursive process, it is important to consider the base cases of the recursion. These base cases construct terminal nodes of the tree, so they predict the labels of the data. These are the 3 base cases:

  1. If all feature vectors have the same label, then the node predicts the common label.
  2. If there are no remaining attributes to split the data on, then the node predicts the most common label.
  3. If there are no feature vectors in the node, then the node predicts the most common label seen by the node’s parent.

The ID3 algorithm, which has pseudo code below executes the described process. An interesting point about this algorithm is that only 2 of the recursive cases are checked at the onset of the algorithm, and the 3rd base case is actually checked prior to the algorithm having a recursive call to it.

The ID3 is the simplest algorithm for training decision trees. C4.5 and C5 are extensions of the ID3 algorithm that are also commonly used.

Pruning

Decision trees are prone to over fitting. The complexity of the decision tree model is a function of the depth of the tree, and similar to all models, as the complexity of a model increases model over fit the training data.

To prevent this, decision trees are pruned to limit the size of the tree. Pruning can occur based on the following heuristics.

Maximum Depth

Setting a maximum depth for decision trees prevents them from growing endlessly. Since the complexity of decision trees is a function of the depth of the tree, this directly limits the complexity of the model. However, this method can be detrimental by not allowing the model to be complex enough. While setting a maximum depth may prevent over fitting, it may also prevent the model from learning all significant aspects of the data.

Minimum Sample Size

A decision tree can be pruned so that every node is based on at least $n$ many samples in the training data. By preventing nodes in the decision tree from representing small subsets of the training data (subsets with less than $n$ samples), the model should only capture aspects of the data that generalize beyond the training data.

Significance Test

Significance pruning prevents a node in a decision tree from growing when the information criteria doesn’t reach a threshold value. Each node in the decision tree represents a test. Returning to the example, the first node’s test is if the person is under the age of 30. These tests are formalized using information theoretic criteria. ID3 uses information gain, but Gini Index and other measures can be used as well. When the maximum information gain from splitting on a feature at a given node is small, the model will not learn more from building a subtree at that node, so it makes sense for the node to be terminal.

--

--

Joshua Pickard
Geek Culture

Computer Science and Bioinformatics @ University of Michigan. Website: https://jpickard1.github.io/ Twitter: @JoshuaPickard_