A Great Explanation of Decision Trees

9 min readMay 24, 2018

“A close-up of a tree and its leaves with the sun leaking through” by D. Jameson RAGE on Unsplash

Decision trees are an important machine learning technique that many would like to understand. There are a plethora of materials already available to assist but they will often lead a learner to be left more confused than when they started. At least, this was the case for myself.

When attempting to learn about topics which involve mathematics I require numerically worked examples. Worked examples illustrate the 5 Ws like no formal definition every could for a novice. Unfortunately it seems there is an aversion for any instructor to belittle themselves in the dirty action of actually working out a problem. This post is an attempt to help rectify that problem.

The structure of this post is as follows:

Examples of decision trees in use.
The advantages and disadvantages of the algorithm.
The presentation of the actual algorithm
Hand worked numerical example.
Illustrated computer package examples for both classification and regression.

Examples

Decision Trees are used in a wide variety of fields. These examples are in credit to Micheal Dorner:

Astronomy: Distinguishing between stars and cosmic rays in images collected by the Hubble Space Telescope.
Medicine: Diagnosis of the ovarian cancer.
Economy: Stock trading.
Geography: To predict and correct errors in topographical and geological data.

Advantages:

Easy to understand: When a decision tree is built you can view the decision rules in a nice looking tree diagram, hence the name!
Resistant to outliers, weak features, and missing values: The splitting criteria does not care greatly how far values are from the decision boundary. The splitting criteria splits according the strongest features first which minimises the harm of the weak features.
Easy to implement in practice: As the model is resistant to outliers and weak features in practice you do not need to spend as much time testing different feature input combinations.

Disadvantages:

Overfits the training data: This is a very large issue but can be minimised using random forests.
Struggles with continuous dependent variables: Due to the leafs containing several observations which are averaged the prediction space is not smooth. This makes highly accurate regression predictions difficult to achieve.
Need important variables: Without strong predictors tree based methods lose many of their strengths.
Trees are unstable: The structure of an estimated tree can vary significantly between different estimations.

The Algorithm

Start at the root node.
For each input, find the potential split that minimises the sum of the node impurities in the two child nodes and from these splits choose the one that gives the overall minimum.
If a stopping criterion is reached then stop growing the tree. Otherwise, apply step 2 to each child node in turn.
(Optional). Once a stopping criterion has been reached you can apply a pruning criteria to cut down the splits of the tree.

Splitting criteria — Gini Impurity

There are several splitting criteria that can be used but I will be using the Gini Impurity method for this worked example. We can define the Gini Impurity with region m and class k as:

Credit for notation to Hastie et al (2017).

In words. The Gini Impurity measures how many observations in a region are of different classes. The higher the number the greater the impurity of classes there is.

Stopping criteria — Minimum Leaf Size

There are several possible rules for when splitting should stop. These possibilities include:

If a split region only has identical values of the dependent variable then that region will not be split any futher.
If all cases in a node have identical values for each predictor, the node will not be split.
If the current tree depth reaches the user-specified maximum tree depth limit value, the tree growing process will stop.
If the size of a node is less than the user-specified minimum node size value, the node will not be split.
If the split of a node results in a child node whose node size is less than the user-specified minimum child node size value, the node will not be split.

Other stopping criteria could include an error minimisation rule, but this tends to miss important splits that could happen. That is why it is preferable to grow the tree and then prune it back to an acceptable level. For this tutorial a minimum leaf size of 5 was chosen.

Pruning Criteria — Misclassification Rate

Worked Example

Let us begin with a worked simple example. It always helps to understand the intuition to see the basics of the method being used.

Lets us build a simple dataset to work the problem by hand. The dataset will be ten observations of two classes, 0 or 1, with two predictors, X1 and X2.


Class <- as.factor(c(0,0,0,0,0,1,1,1,1,1)) # The 2 class vector
X1 <- c(4,4.5,5,5.5,3,5.6,6,6.5,6.2,5.9) # Random values for predictor 1
X2<- c(9,10,11,10,9,8,7,8,7,8) # Similarly
df <- cbind.data.frame(Class, X1, X2) # Combine the class vector and the two predictors

From the graph it is obvious how to split the data but lets us worked it out using the algorithm to see how it works. First we will calculate the Gini Impurity for each possible split in the range of each predictor.


Predictor1test <- seq(from = 3, to = 7, by = 0.1) # The potential splits where we calculate the Gini Impurity
Predictor2test <- seq(from =7, to = 11, by = 0.1) # Similar for Predictor 2
CalculateP <- function(i, index, m, k) { # Function to calculate the proportion of observations in the split
 if(m==”L”) { # region (m) which match to class (k) 
 Nm <- length(df$Class[which(df[,index] <= i)]) # The number of observations in the split region Rm
 Count <- df$Class[which(df[,index] <= i)] == k # The number of observations that match the class k
 } else {
 Nm <- length(df$Class[which(df[,index] > i)])
 Count <- df$Class[which(df[,index] > i)] == k
 } 
 P <- length(Count[Count==TRUE]) / Nm # Proportion calculation
 return(c(P,Nm)) # Returns both the porportion and the number of observations
}
CalculateGini <- function(x, index) { # Function to calculate the Gini Impurity
 Gini <- NULL # Create the Gini variables
 for(i in x) {
 pl0 <- CalculateP(i, index, “L”, 0) # Proportion in the left region with class 0
 pl1 <- CalculateP(i, index, “L”, 1)
 GiniL <- pl0[1]*(1-pl0[1]) + pl1[1]*(1-pl1[1]) # The Fini for the left region
 pr0 <- CalculateP(i, index, “R”, 0)
 pr1 <- CalculateP(i, index, “R”, 1)
 GiniR <- pr0[1]*(1-pr0[1]) + pr1[1]*(1-pr1[1])
 Gini <- rbind(Gini, sum(GiniL * pl0[2]/(pl0[2] + pr0[2]),GiniR * pr0[2]/(pl0[2] + pr0[2]), na.rm = TRUE)) # Need to weight both left and right Gini scores when combining both
 }
 return(Gini)
}
Gini <- CalculateGini(Predictor1test, 2)
Predictor1test<- cbind.data.frame(Predictor1test, Gini)
Gini <- CalculateGini(Predictor2test, 3)
Predictor2test<- cbind.data.frame(Predictor2test, Gini)

We can see that the most pure split for predictor1 is at 5.5. All other splits leave some impurity in the resulting spaces.

Here we can see a region where the Gini Impurity is minimised. Any value here would be suitable. Now we can observe a hand calculation of the Gini Impurity for X1 = 5.5.

With a minimum off 5 observations per leaf we are already at the stopping criteria but let us see the misclassification for each off our potential stopping criteria for the sake of illumination. Also due to the purity of the split we do not need to prune the tree.


CalculatePkm <- function(i, index, m) { # This is different to the other P function is that it calculates the proportion to
 if(m==”L”) { # only the majority class
 Nm <- length(df$Class[which(df[,index] <= i)]) 
 Km <- as.integer(names(sort(table(df$Class[which(df[,index] <= i)]), decreasing = TRUE)[1]))
 Count <- df$Class[which(df[,index] <= i)] == Km
 } else {
 Nm <- length(df$Class[which(df[,index] > i)])
 Km <- as.integer(names(sort(table(df$Class[which(df[,index] > i)]), decreasing = TRUE)[1]))
 Count <- df$Class[which(df[,index] > i)] == Km
 } 
 P <- length(Count[Count==TRUE]) / Nm
 return(c(P,Nm))
}
CalculateMissClass <- function(x, index) {
 miserr <- NULL
 for(i in x) {
 pLkm <- CalculatePkm(i, index, “L”)
 missclassL <- (1 — pLkm[1])
 pRkm <- CalculatePkm(i, index, “R”)
 missclassR <- (1 — pRkm[1])
 miserr <- rbind(miserr, sum(missclassL * pLkm[2]/(pLkm[2] + pRkm[2]),missclassR * pRkm[2]/(pLkm[2] + pRkm[2]), na.rm = TRUE))
 }
 return(miserr)
}
miserr <- CalculateMissClass(Predictor1test[,1], 2)
Predictor1test<- cbind.data.frame(Predictor1test, miserr)
miserr <- CalculateMissClass(Predictor2test[,1], 3)
Predictor2test<- cbind.data.frame(Predictor2test, miserr)

Similar to the case with the Gini Impurity we can see the regions where the measure is minimised. Now we can go through a hand drawn problem and see the calculation in action.

Now with such a small dataset it does not make sense to prune the tree but let us continue to see an example with real data.

Example: Classification Tree

For this example we will use the iris dataset that comes packed with R. There are three species of the iris flower: setosa, versicolor, and virgincia. We will use the classification tree process to separate the feaures of Sepal Length, Sepal Width, Pdeal Length, and Petal Width.

Species of Iris with each Predictor Combination

From the data you can see that Setosa can be separated easily from the other two species. Try it yourself, where would you draw the line to separate Setosa?

Build the tree

Now that we have worked through how the classification tree is grown we can resort to using already established packages to estimate our decision tree for us. I will be using the rpart package for this post but there are many other packages that can estimate trees.


Tree <- rpart(Species ~ ., data=iris, parms = list(split = ‘gini’) )

The Splits

The tree growing process can involve many more splits than the single split from our hand worked problem. Here we can see all the splits that were done when growing this tree.

There are quite a few splits here! So let us look at the two most important splits, Petal.Length = 2.5 and Petal.Width = 0.8.

Where the split happens with PetalLength = 2.5

Where the splits occur when PetalWidth = 0.8

The Final Pruned Tree

This is the final tree diagram. You may notice that the tree has been pruned to have only 2 split nodes. Indeed can see that the split for Petal.Width is moved to 1.8 instead of at 0.8 in the final decision tree.

Example: Regression Tree

For this example I will be providing less explanation.

Splitting Criteria — Minimising The Sum Of Squares

The equation calculates, for each variable, the sum of squared errors. The value for $c_{m}$ is simply the average observation value in that region. The algoirthm then finds the smallest sum of squares for that variable and does that for each variable. The algorithm finally compares the champion split from each variable to determine the winner for the overall split to occur. The stopping criteria will once again be a minimum size of terminal nodes to be 5.

Pruning Criteria — Cost Complexity Criteria

Let:

Our goal is to minimise this function. The compromise between tree size and its fit to the data is dictated by $\alpha$. Smaller values lead to larger trees and larger values lead to more pruning.

There are processes by which to estimate $\alpha$ but we will not go into that today.

The Data: USA Arrests


ggplot(data = USArrests, aes(x = Assault, y = UrbanPop)) + 
 geom_point(aes(color=Murder), size = 6, alpha = .5)


RegressionTree <- rpart(Murder~ Assault + UrbanPop, data=USArrests)

The First 4 Splits Animated

The Constructed Tree

Summary

In this post we looked at decision trees, went through a hand worked numerical examples, and finally observed both classification trees and regression trees grown using computer packages.