Decision Tree Algorithm: They are everywhere

10 min readAug 1, 2023

Everything you want to know about a decision tree, theory and implementation.

Introduction

“Two roads diverged in a wood, and I took the one less travelled by, And that has made all the difference.”

The poem “The Road Not Taken” by Robert Frost features this phrase. The poem eloquently describes how one’s decision can leave a lasting impact on their life. Aren’t decisions an integral part of our life? The paths we take and the decisions we make greatly influence our destiny and identity.

The decision tree algorithm is like this poem, showing how a decision creates a new branch in the tree; similarly, a small decision can create different routes and transform our life.

Photo by Javier Allegue Barros on Unsplash

The most basic understanding of Decision trees is that it is a giant structure of nested if-else statements.

Here is a small categorical dataset of 6 rows and 3 columns which we will try to solve with nested if-else statements as follows:

g = input("Enter gender(M/F): ")
print("Occupation: ")
print("1-> Student")
print("2-> Programmer")
o = int(input("Enter occupation: " ))
if g=='F':
    if o==1:
        print('Fortnite')
    elif o==2:
        print('GitHub')
elif g=='M':
    if o==1:
        print('Minecraft')
    elif o==2:
        print('VS Code')

The first question I asked myself was if a decision tree is just a collection of if-else statements, used to solve problems, then why do we need a decision tree? The answer is:

If-else statements can be brittle, which means they can be easily affected by small changes in the data. Decision trees, on the other hand, are more robust to changes in the data.
If-else statements are less scalable than decision trees, which means they can be slow to train and predict on large datasets.
If-else statements can be challenging to interpret and debug. Decision trees, on the other hand, are a more visual and intuitive way to represent decision rules.

Diagrammatically this will look like as follows:

Basic terminologies related to a tree data structure are as follows:

Root node: The root node is the topmost node. It is the starting node from which the decision tree arises.
Parent node: A parent node is a node that has one or more child nodes.
Child node: A child node is a node branched off from a parent node. The parent node is the node that makes a decision, and the child nodes are the nodes that represent the possible outcomes of that decision.
Leaf node: A leaf node in a decision tree is a node that does not have any child nodes. Leaf nodes represent the terminal nodes of the decision tree, and they contain the final prediction of the model.
Splitting: The process of dividing a node into two or more sub nodes.
Pruning: Pruning is opposite of splitting. The process of removing sub-nodes of a decision node, is called pruning.

What is a decision tree algorithm?

A decision tree is a supervised learning algorithm that uses a tree-like structure to make decisions based on input data. It divides data into branches and assigns outcomes to leaf nodes.
Decision trees are used for classification and regression tasks, providing efficient, accurate and easy-to-understand models.

Different libraries use different algorithms to perform decision trees, but in this article, we will discuss CART: Classification And Regression Trees.

CART is a recursive algorithm that starts at the tree’s root node and splits the data into two or more subsets based on the value of a single feature. The process is repeated until each subset is pure, meaning all the data points in the subgroup belong to the same class.
The cart uses various impurity measures, such as entropy and Gini impurity, to decide which feature to split on at each node.

Mathematically speaking, decision trees use hyperplanes which run parallel to any one of the axes to cut our coordinate system into hyper cuboids.

Why do we need decision tree in machine learning?

Simple to understand and interpret, which helps the developers to debug and understand the model.
Robust to noise means decision trees can handle noisy data relatively well, making them a good choice for problems where the data is not perfectly clean.
Decision Trees are efficient to train and flexible.
Decision Trees are widely used in Credit scoring, Fraud detection, Medical diagnostic, customer segmentation, etc.

Some critical concepts related to the CART algorithm are:

Entropy
Information Gain
Gini Index

Entropy, Information gain and Gini impurity are three essential concepts in decision tree algorithms. They are used to measure the data’s impurity and determine which feature to split the data on. Decision tree algorithms are used to build accurate and reliable models using these concepts.

1. Entropy:

Entropy is a measure of impurity or disorder. In simple terms, it is the measure of the randomness.

Entropy helps us identify the most important features for predicting the target variable. It helps to reduce the impurity of the data, which makes the model more accurate. The formula of entropy is as follows:

P yes: probability of yes
P no: probability of no
E: Entropy

Some basic observations with entropy are as follows:

For a 2-class problem(‘yes’ or ‘no’ / ‘true’ or ‘false’), the min-entropy is 0, and the max entropy is 1.
For more than two classes, the min-entropy is 0, but the max can be greater than 1.
A dataset with 0 impurities is not helpful for learning, but a dataset of half “yes” and half “no” entropy will be 1, and such datasets are suitable for learning.

2. Information Gain:

Information gain helps the algorithm decide which attribute should be selected as a decision node or which feature to split the data. It is a metric that measures the quality of a split. Information gain is a measure of entropy reduction achieved by splitting a node on a particular feature.

Constructing a decision tree is all about finding an attribute that returns the highest information gain. A feature that has a high information gain is a good splitter, as it will result in a more pure node. The formula for information gain is as follows:

E(Parent): Entropy of Parent
Weighted Entropy/Information = {Weighted Average}*E(Children)
E(Children): Entropy of Children
IG: Information Gain

So, then, what’s the relationship between entropy and information gain?

The goal of the decision tree algorithm is to finally unmix the labels as we proceed down or, in other words, to produce the purest possible distribution of the possible titles at each node. In decision trees, entropy and information gain determine which feature to split a node on at each step in constructing the tree. The part with the highest information gain is chosen, as this will result in the most significant reduction in entropy and the purest nodes.

The higher the information gain lower is the entropy. Entropy is a measure of impurity. The more impure the data, the more uncertain it is. By reducing the entropy of the data, we make it more specific, which leads to a higher information gain.

Information gain measures how much the entropy of the data is reduced by splitting on a particular feature. The more the entropy is reduced, the more homogeneous the child nodes will be and the better the split.

3. Gini Impurity/Index:

It is a misclassification measure used when the data contain multi-class labels. Gini is similar to entropy, but it calculates much quicker than entropy. Algorithms like CART (Classification and Regression Tree) use Gini as an impurity parameter.

The minimum value of the Gini Index is 0, whereas the maximum value of the Gini Index is 0.5.

It is faster and computationally less expensive than entropy because entropy uses logarithms, but the results obtained using the entropy criterion are slightly better.

p yes = probability of yes
p no = probability of no
p i = probability of class i

Gini impurity and Entropy; source: quantdare.com

How a decision tree selects a root node and performs splits?

The five columns in our dataset are outlook, temp, humidity, wind, and play. The play column contain the result of whether one will play or not based on the criteria of four input columns.

1. Entropy of entire dataset:

The entire dataset’s entropy refers to the target column’s entropy. Here our target column is “play”.

import pandas as pd
df = pd.read_csv('play_tennis.csv')
df = df.iloc[:,1:]
print(df)

here,

9/14 = no. of yes/total outcomes in sample space

5/14 = no. of no/total outcomes in sample space

2. Information Gain of every column:

As there are four input columns, “outlook”, “temp”, “humidity” and “wind” we need to calculate information gained for each column.

The information gain for outlook column is as follows:

print(df.groupby('outlook')['play'].value_counts())

Here we can see that in our first column outlook, there are three categorical values: Sunny, Overcast and Rain.

(a) We need to calculate the entropy of Sunny, Overcast and Rain:

(b) Information from the outlook column, which is equal to weighted average*entropy:

(c ) Information gain of the outlook column:

(d) Information gain of other columns: temp, humidity and wind are as follows:

The column with the highest information gain is selected as the root node. Since the outlook column has the highest information gain, it is considered the root node.
Once it has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets, and so on, recursively until it reaches the maximum depth.

Decision Tree implication using sklearn library:

One of the key points to remember before implementing a Decision Tree Classifier is that the CART algorithm requires all columns to be numerical, even in classification problems.
This is because, CART uses a greedy algorithm to build the decision tree, and it needs to be able to calculate the impurity of each node in the tree.
A machine-learning model uses one hot encoding technique to convert categorical variables to numerical values, easily achieved using sklearn Label Encoder.

#imported the dataset:
import pandas as pd
df = pd.read_csv('play_tennis.csv')
df = df.iloc[:, 1:6]
print(df.head())

# performed one hot encoding: converted categorical to numerical columns
from sklearn.preprocessing import LabelEncoder
cols = ['outlook', 'temp', 'humidity', 'wind', 'play']
le = LabelEncoder()
def preprocess(df):
    for col in cols:
        df[col] = le.fit_transform(df[col])
    return df
preprocess(df)
print(df.head())

#selected the input and target columns:
x = df.iloc[:, 0:4].values
y = df.iloc[:, 4]

#imported Decision Tree Classifier:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='entropy')

# trained the model:
dt.fit(x, y)

# visualization of decision tree:
from sklearn import tree
import graphviz
tree.plot_tree(dt, feature_names=cols, rounded=True, filled=True)

Visualisation of Decision Tree:

Hyperparameters of Decision Tree

Hyperparameters are parameters that are explicitly specified and control the training process. They play an essential role in model optimization.

A few of the most common hyperparameters of decision tree classifier are as follows:

max_depth: This hyperparameter specifies the maximum depth of the tree. A deeper tree will have more splits and will be able to capture more information about the data, but it may also be more prone to overfitting.
min_samples_split: This hyperparameter specifies the minimum number of samples required to split a node. A lower value will allow the tree to split more nodes, which can lead to a more complex tree.
min_samples_leaf: This hyperparameter specifies the minimum number of samples required in a leaf node. A lower value will allow the tree to have more leaf nodes, which can improve the accuracy of the tree.
criterion: This hyperparameter specifies the splitting standard that can be used to determine the best split for a node. The most common criteria are “gini” and “entropy.”
splitter: This hyperparameter specifies the splitting algorithm that will be used to find the best split for a node. The most common splitters are best, random, and uniform.

Disadvantages of Decision Trees

Decision trees are a powerful machine learning algorithm that can be used for both classification and regression tasks. However, they also have some disadvantages:

Overfitting: Decision trees can be prone to overfitting, meaning they work well in training data but not on testing sets.
Sensitivity to small changes in the data: Small changes in the data can significantly impact the structure of a decision tree, which can lead to overfitting or instability.
Interpretability: Decision trees can be challenging, especially when large and complex. This can make it difficult to understand how the model works and to use it to make decisions.
Large decision trees are computationally expensive.

Here, I am adding a GitHub link to a heart disease detection model performed using decision tree algorithm.

GitHub - mrinmoyxb/Heart-Disease-Detection: This is a Machine Learning Heart Disease Detection…

This is a Machine Learning Heart Disease Detection model performed using Decision Tree Algorithm - GitHub …

github.com