Decision Trees: Explained in Simple Steps

Published in

Analytics Vidhya

6 min readOct 25, 2020

Can’t understand this meme? Read the article :)

1. Introduction

Unlike the meme above, Tree-based algorithms are pretty nifty when it comes to real-world scenarios. Decision Tree is a supervised (labeled data) machine learning algorithm that can be used for both classification and regression problems.

It’s similar to the Tree Data Structure, which has a root, and multiple other types of nodes (parent, child, and leaf).

In the following example, we’ve to approve a loan on the basis of the age, salary, and no. of children the person has. We ask a conditional question at each node and make splits accordingly, till we reach a decision at the leaf node (i.e get loan/don’t get loan).

2. Why use Decision Tree?

Advantages

They have high interpretability, which makes them the go-to algorithm for real-world business applications. They can be explained as a series of questions/ if-else statements.
The time required to make predictions is very less. It’s just the evaluation of a particular set of conditions for a given data point.
No missing value imputation is required as the algorithm adapts accordingly.

Disadvantages

They are unstable. A small change in data can lead to a vastly different decision tree.
They are often relatively inaccurate and prone to overfitting. This can be rectified by replacing a single decision tree with a random forest of decision trees.

3. Basic Intuition

Let’s try to build intuition by using an example. We’ll take a very popular one, where we have to decide whether we can play golf (target) on a particular day or not, based on the given weather conditions (predictors).

In this example, the given predictors are categorical in nature and so is the target. The steps taken to build the decision tree were as follows-

First, we picked the Outlook feature as our node and created splits for every value of it. For the Overcast category, every outcome was “Yes” for the Play Golf (Target) variable which essentially means that whenever the Outlook was Overcast, we could play golf. We made a final decision so we make it a leaf node.
For the remaining categories, the outcome wasn’t as clear as Overcast as it consisted of a mixture of yes and nos. We further split these nodes based on other columns like Windy or Humidity, till we’re able to make a decision (i.e whether we’ll be able to play golf or not).

After looking at the example above, the following questions might arise :

Which feature should we choose first?
What should be the order of splitting?
How do we decide the number of child nodes?
How can we compare two features w.r.t which one is better for a split?
How do we decide the splitting criteria?

These questions will be answered in the next section, where we will look at an algorithm called C4.5 which uses the concepts of Entropy and Information gain.

4. Algorithm

There are multiple algorithms (with extremely long, and scary names) which are used for building Decision Trees. Some of them are :

ID3 (Iterative Dichotomiser 3)
C4.5 (successor of ID3)
CART (Classification And Regression Tree)
Chi-square automatic interaction detection (CHAID). Performs multi-level splits when computing classification trees.
MARS: extends decision trees to handle numerical data better.

We’re going to focus on C4.5 for now as it’s popular and is able to deal with both categorical and numerical features. Before we dive into the algorithm, we need to know the meaning of certain terms :

Root node: The topmost node in a tree.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node
Splitting: It is a process of dividing a node into two or more sub-nodes.
Parent and Child Node: A node, which is divided into sub-nodes is called the parent node of sub-nodes whereas sub-nodes are the child of the parent node.
Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.

Some more important terms specific to the C4.5 Algorithm

Entropy

Entropy in physics is simply a metric for measuring the degree of disorder or randomness of a system.

In the context of Decision Trees, it can be thought of as a measure of disorder or uncertainty w.r.t predicting the target. Let’s take the example of Red, Blue, and Green balls in boxes. A box with 6 blue balls will have very low (zero) entropy whereas a box with 2 blue, 2 green, and 2 red balls would have relatively high entropy.

The formula for calculating Information Entropy for a dataset with C classes:

where pᵢ is the probability of randomly picking an element of class i (i.e. the proportion of the dataset made up of class i).

Let’s take an example to understand this better. Consider a dataset with 1 blue, 2 green, and 3 red balls then :

Original formula expanded to show the three classes in the dataset

We know pb = 1/6 (blues) because 1/6th of the dataset is blue. Similarly, pg = 2/6 (greens) and pr = 3/6 (reds). Thus,

Entropy for 1 blue, 2 green, and 3 red balls

Whereas, for a dataset consisting only of blue balls, the entropy would be :

Information Gain

Information gain is a metric that helps us determine which attribute in a given set of training feature vectors is most useful for discriminating between target classes. We use it to decide the ordering of attributes in the nodes of the Decision Tree.

The formula for Information Gain

This formula for information gain is pretty intuitive. We look at the difference between the entropy of parent and children. A high reduction in entropy is good as we’re able to distinguish between target classes better. This simply implies that we choose the feature which has the highest information gain.

Steps for C4.5 Algorithm

C4.5 is a recursive algorithm as it recursively picks the feature which gives maximum information gain and uses it to split the tree further.

A simple flowchart explaining the steps of the algorithm

Choose the initial dataset with the feature and target attributes defined.
Calculate the Information gain and Entropy for each attribute.
Pick the attribute with the highest information gain, and make it the decision root node.
Calculate the information gain for the remaining attributes.
Create recurring child nodes by starting splitting at the decision node (i.e for various values of the decision node, create separate child nodes).
Repeat this process until all the attributes are covered.
Prune the Tree to prevent overfitting.

That was it for the C4.5 Algorithm.

The content of this article was a mixture of information from various sources on the internet, combined with my understanding of them.

References :

Give the article a clap if you liked it and make sure to drop any feedback/suggestions/questions in the comments :)

Live long and prosper