Understanding Decision Trees

4 min readJun 28, 2023

Decision trees are one of the most used algorithms in machine learning. They are simple to use and perform well with large datasets.

Structure

A decision tree is an algorithm used in Supervised Learning (read a simple explanation in my post on Supervised and Unsupervised Learning). It is like a flow chart with decision nodes that split into child nodes.

Decision trees are simple to understand and interpret. Each node is a test on a feature (eg. is it morning?) and each branch is the outcome of the test (eg. yes or no). The top node is called the root node. The bottom nodes that not split anymore are the leaves. The various paths from the root to leaf are the decision rules.

Application

Decision trees can be used to solve either Regression and Classification problems. For example, if you want to predict the weight of a person or any other continuous numeric quantity, then this is a regression problem. On the other hand, if you want to classify whether this animal is a bird or a reptile, or any other category, then this a classification problem.

How to build a decision tree

You pick a feature (root node), you split the data based on that feature so that the outcome is binary (no data points belong to both sides of the split) and you define a new decision rule. You repeat the process until each leaf node is pure or homogeneous (all the data points in a leaf node belong to the same class). This process is called recursive partitioning.

How to optimise the split?

In order to find the best split in decision trees, we can use scoring metrics such as Entropy and Gini Impurity which help up rank the features.

Entropy and Information Gain

Entropy is a measure of uncertainty in a random variable. If X is a random variable with the probability mass function p(X), then the entropy of X is:

E represents the expected value. We use the logarithm to the base 2.

In decision trees, we are interested in the entropy of the target variable for a given split. This is called conditional entropy: H(T/a), entropy of T given a.

The Information Gain (IG) is the difference between the entropy of the parent node and the conditional entropy of the child node for a given split:

The lower the entropy of the child nodes, the larger the information gain is. In other words, information gain is the reduction in entropy. We split the nodes at the most informative features by maximising the information gain at the split.

Gini Impurity

Another metric to determine how well a decision tree is split is the Gini impurity which tells us what is the probability of misclassifying or mispredicting a datapoint. The lower the Gini impurity, the better the split. In other words, the lower the likelihood of misclassification. If we have a dataset of classes C, and P( i) is the probability of picking a data point of class i, then the Gini Impurity is:

Advantages

Using decision trees compared to other Supervised learning algorithms has many advantages:

simple to understand and explain (interpretability)
able to handle both numerical and categorical data
performs well with large datasets
requires little data preparation
naturally de-emphasises non relevant features (no need for PCA)

Limitations

However as my teacher says “there is no free lunch”, here are some of the limitations when using decision trees:

a slight change in the training data can result in a big change in the tree, and therefore predictions (non robust)
the algorithm doesn’t guarantee obtaining the globally optimal decision tree (greedy algorithm)
overfitting: they work well with training data but are not flexible when classifying new samples (unseen data), in other words, they don’t generalise as we increase the complexity (depth of the tree, number of features)

In conclusion, it is easy to understand why decision trees are among the most popular supervised learning algorithms given its simplicity and interpretability. Its limitations however, will lead us to my next post: Wandering in Random Forests.