DECISION TREE

Published in

Analytics Vidhya

9 min readJan 13, 2021

In this Blog I will be writing about a widely used classification (machine learning) algorithm, that is, Decision Tree.

Here I will explain about what is Decision Tree, Types of Decision Tree Algorithm, How to create a Decision Tree, Applications of Decision Tree, Advantages/Disadvantages and finally I will provide link to my briefly explained Jupyter Notebook on implementation of decision tree algorithm from scratch.

So without any further due lets get started.

What is Decision Tree Algorithm?

Decision tree algorithm falls under the category of supervised learning. They can be used to solve both regression and classification problems.
Decision tree uses the tree representation to solve the problem in which each leaf node corresponds to a class label and attributes are represented on the internal node of the tree.
We can represent any boolean function on discrete attributes using the decision tree.

Below are some assumptions that we made while using decision tree:

At the beginning, we consider the whole training set as the root.
Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
On the basis of attribute values records are distributed recursively.
We use statistical methods for ordering attributes as root or the internal node.

Types of Decision Trees

Decision Trees are classified into two types, based on the target variables.

Categorical Variable Decision Trees: This is where the algorithm has a categorical target variable. For example, consider you are asked to predict the relative price of a computer as one of three categories: low, medium, or high. Features could include monitor type, speaker quality, RAM, and SSD. The decision tree will learn from these features and after passing each data point through each node, it will end up at a leaf node of one of the three categorical targets low, medium, or high.
Continuous Variable Decision Trees: In this case the features input to the decision tree (e.g. qualities of a house) will be used to predict a continuous output (e.g. the price of that house).

Key Terminology

Let’s see what a decision tree looks like, and how they work when a new input is given for prediction.

Below is an image explaining the basic structure of the decision tree. Every tree has a root node, where the inputs are passed through. This root node is further divided into sets of decision nodes where results and observations are conditionally based. The process of dividing a single node into multiple nodes is called splitting. If a node doesn’t split into further nodes, then it’s called a leaf node, or terminal node. A subsection of a decision tree is called a branch or sub-tree (e.g. in the box in the image below).

There is also another concept that is quite opposite to splitting. If there are ever decision rules which can be eliminated, we cut them from the tree. This process is known as pruning, an is useful to minimize the complexity of the algorithm.

How To Create a Decision Tree

In this section, we shall discuss the core algorithms describing how decision trees are created. These algorithms are completely dependent on the target variable, however, these vary from the algorithms used for classification and regression trees.

There are several techniques that are used to decide how to split the given data. The main goal of decision trees is to make the best splits between nodes which will optimally divide the data into the correct categories. To do this, we need to use he right decision rules. The rules are what directly affect the performance of the algorithm.

There are some assumptions that need to be considered before we get started:

In the beginning, the whole data is considered as the root, thereafter, we use the algorithms to make a split or divide the root into subtrees.
The feature values are considered to be categorical. If the values are continuous, then they are separated prior to building the model.
Records are distributed recursively on the basis of attribute values.
The ordering of attributes as root or internal node of the tree is done using a statistical approach.

Let’s get started with the commonly used techniques to split, and thereby, construct the Decision tree.

1. Information Gain

When we use a node in a decision tree to partition the training instances into smaller subsets the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and Values (A) is the set of all possible values of A, then

Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy more the information content.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and Values (A) is the set of all possible values of A, then

Example:

For the set X = {a,a,a,b,b,b,b,b}
Total intances: 8
Instances of b: 5
Instances of a: 3= -[0.375 * (-1.415) + 0.625 * (-0.678)] 
              =-(-0.53-0.424) 
              = 0.954

Building Decision Tree using Information Gain
The essentials:

Start with all training instances associated with the root node
Use info gain to choose which attribute to label each node with
Note: No root-to-leaf path should contain the same discrete attribute twice
Recursively construct each subtree on the subset of training instances that would be classified down that path in the tree.
If all positive or all negative training instances remain, label that node “yes” or “no” accordingly
If no attributes remain, label with a majority vote of training instances left at that node
If no instances remain, label with a majority vote of the parent’s training instances

Example:
Now, lets draw a Decision Tree for the following data using Information gain.

Training set: 3 features and 2 classes

Here, we have 3 features and 2 output classes.
To build a decision tree using Information gain. We will take each of the feature and calculate the information for each feature.

Split on feature X

Split on feature Y

Split on feature Z

From the above images we can see that the information gain is maximum when we make a split on feature Y. So, for the root node best suited feature is feature Y. Now we can see that while splitting the dataset by feature Y, the child contains pure subset of the target variable. So we don’t need to further split the dataset.

The final tree for the above dataset would be look like this:

2. Gini Index

Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified.
It means an attribute with lower Gini index should be preferred.
Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
The Formula for the calculation of the of the Gini Index is given below.

Example:
Lets consider the dataset in the image below and draw a decision tree using gini index.

In the dataset above there are 5 attributes from which attribute E is the predicting feature which contains 2(Positive & Negative) classes. We have an equal proportion for both the classes.
In Gini Index, we have to choose some random values to categorize each attribute. These values for this dataset are:

A       B        C         D
  >= 5     >= 3.0      >= 4.2    >= 1.4
   < 5      < 3.0       < 4.2     < 1.4

Calculating Gini Index for Var A:
Value >= 5: 12
Attribute A >= 5 & class = positive: 5/12
Attribute A >= 5 & class = negative: 7/12
Gini(5, 7) = 1 — [ (5/12)² + (7/12)²] = 0.4860

Value < 5: 4
Attribute A < 5 & class = positive: 3/4
Attribute A < 5 & class = negative: 1/4
Gini(3, 1) = 1 –[(3/4)² + (1/4)²] = 0.375

By adding weight and sum each of the gini indices:
gini(Target, A) = (12/16)*(0.486) + (4/16)*(0.375) = 0.45825

Calculating Gini Index for Var B:
Value >= 3: 12
Attribute B>= 3 & class = positive: 8/12
Attribute B>= 3 & class = negative: 4/12
Gini(5, 7) = 1 — [ (8/12)² + (4/12)²] = 0.4460

Value < 3: 4
Attribute B< 3 & class = positive: 0/4
Attribute B< 3& class = negative: 4/4
Gini(3, 1) = 1 –[(3/4)² + (1/4)²] = 0.375

By adding weight and sum each of the gini indices:
gini(Target, B) = (12/16)*(0.446) + (0/16)*(0) = 0.3345

Applications of Decision Trees

Decision Tree is one of the basic and widely-used algorithms in the fields of Machine Learning. It’s put into use across different areas in classification and regression modeling. Due to its ability to depict visualized output, one can easily draw insights from the modeling process flow. Here are a few examples wherein Decision Tree could be used,

Business Management
Customer Relationship Management
Fraudulent Statement Detection
Energy Consumption
Healthcare Management
Fault Diagnosis

Advantages and Disadvantages

There are a few pros and cons that come along with the decision trees. Let’s discuss the advantages first. Decision trees take very little time in processing the data when compared to other algorithms. Few preprocessing steps like normalization, transformation, and scaling the data can be skipped. Although there are missing values in the dataset, the performance of the model won’t be affected. A Decision Tree model is intuitive and easy to explain to the technical teams and stakeholders, and can be implemented across several organizations.

Here comes the disadvantages. In decision trees, small changes in the data can cause a large change in the structure of the decision tree that in turn leads to instability. The training time drastically increases, proportional to the size of the dataset. In some cases, the calculations can go complex compared to the other traditional algorithms.

Building Decision Tree Model from Scratch

Step 1: Import important libraries and load the data

Step 2: Calculate Entropy for each attribute and select the attribute which has maximum information gain as a root node

Step 3: Built a Decision Tree

Step 4: Make predictions

For complete code visit the below link. That is my Jupyter notebook.

Shag10/Machine-Learning

This repository contains the basics of machine learning algorithms. Algorithms based on supervised learning…

github.com

I tried to provide all the important information on getting started with Decision Tree and its implementation. I hope you will find something useful here. Thank you for reading till the end.

References

Decision Tree Introduction with example - GeeksforGeeks

Decision tree algorithm falls under the category of supervised learning. They can be used to solve both regression and…

www.geeksforgeeks.org

A Complete Guide to Decision Trees | Paperspace Blog

Decision Trees are the foundation for many classical machine learning algorithms like Random Forests, Bagging, and…

blog.paperspace.com