All About Decision Trees Part :- I

Gaurav Kumar
Geek Culture
Published in
7 min readJun 27, 2021
Source:- scikit-learn

Decision Tree is a famous Machine Learning Algorithm which is used in both Regression and Classification problem.

This algorithm works by making the statement of a choice and based on its results whether True or False provides different conclusion.

Below is the diagram of a very basic decision tree.

Basic Decision Tree(Source)

By default in a decision tree usually when the condition is True, then the sub branch is shown on the left and when the condition is False the it is shown on the right.

So, if you are provided with True/False written in the diagram, you can consider the default directions.

Basic Terminologies related to Decision Trees:

  1. Root Node:- This is the beginning node of the tree after which whole population splits based on different features.
  2. Branch Node:- This is the node below the root node which splits further. In the flowchart, a branch node is one on which arrows are both pointing and moving away.
  3. Leaf Node:- The node which can’t be split further to make decisions is the leaf node. In the flowchart, it is the node on which arrows are pointing, but there are no arrows pointing away from it.

The below picture can help you understand about these parts of the tree.

With respect to the target variable, there are two types of decision trees:

  1. Regression Trees: When the target/dependent variable is a continuous numeric variable, then the decision tree used to predict that numeric feature is called Regression Tree.
  2. Classification Trees: When the target/dependent variable is discrete categorical variable then the decision tree used to predict that target feature is called Classification Tree.

Classification Trees:

Now, let’s try to make a classification tree.

Credit:- Statquest

Now, let’s try to make a decision tree from scratch using the data provided here.

In this data, ‘Loves Cool As Ice’ is the target variable and others are the independent variables.

Here, we are trying to predict whether an individual ‘Loves Cool As Ice’ variable based on the other parameters using decision trees.

So, first step is to decide which variable is the most suitable for the root node, which is done by understanding whether a particular variable is able to predict the target variable in the most efficient way or not.

Let’s consider Loves Popcorn as the root node:

Classification tree made after considering Loves Popcorn as the root node.

As you can see in this case, the ‘Loves Popcorn’ feature is unable to distinguish completely between the users who Loves As Cool As Ice or not.

So, now let’s try another feature as a root node.

Let’s try Loves Soda as a root node.

Classification tree made after considering Loves Soda as the root node.

Although in this case, those who don’t love soda also don’t love as cool as ice but in case of those who love soda, still we are not able to completely separate the users who love as cool as ice or not.

Since, it is not always possible that a variable is able to completely distinguish the target variable, we should choose the variable which best distinguish the target variable out of all the dependent features.

For that purpose, we need to first understand about the term impurity:

In the above two decision trees, the leaves containing both YES and NO were impure and those having only one out of YES or NO were not impure.

Left leaf is impure and Right Leaf is not impure

Hence, only that variable is used as the root node, which gives the least impurity for the target feature.

Now, there are several measures to quantify the impurity of a leaf like: Entropy, Information Gain, Gini Impurity out of which I shall be using Gini Impurity currently to explain you about the classification trees.

Formula to calculate the Gini Impurity of a leaf is:

Calculation Gini Impurity of the Corresponding leaf

Calculating Gini Impurity of a leaf is not enough as we are supposed to calculate the Gini Impurity of the whole split done by the root node. So, finally the Total Gini impurity of the tree is calculated by calculating the Weighted average of Gini Impurities for the Leaves, which is given by:

Formula to calculate total Gini impurity of a tree
Total Gini Impurity

So, the Total Gini Impurity of each tree is compared having different root node and the variable responsible for the lowest Gini Impurity is considered as the root node.

Now, the question arises, as there is also a numeric column in the above data, how to calculate the Gini impurity in case of numeric data.

And one more question arises before that is: on the basis of which numerical value should the branches be created, whether it should be Age <12, Age <18, Age < 38 or something else !!.

Mean of Each Pair

To decide the numerical value on the basis of which branches should be created, the mean of each pair of age should be calculated first and then considering each value of mean as a threshold, Gini Impurity needs to be calculated and then mean value with lowest Gini Impurity becomes the root node.

So, as shown below the total Gini Impurity of each case is calculated and compared.

Gini Impurity in each case

So, as we can see here the Gini Impurity of 0.343 is minimum so in this case either 15 or 44 can be chosen as the threshold value.

So, now in order to decide the final root node the Gini Impurity of all the independent features is compared.

Total Gini Impurity of all columns

So, in this case the ‘Loves Soda’ becomes the root node of the decision tree.

But still we can’t stop here !!!!

We have just found the root node but haven’t made the whole tree now..

After deciding the root node, now we are supposed to decide the branches.

Deciding the parameter for branch node

Now to choose the parameter for further splits, the whole above process of comparison of Gini Impurity is repeated but one should keep in mind to follow the condition of the root node i.e, consider only those rows that follow the condition of the root node.

For Example, In the above diagram, in order to find whether to keep ‘Age’ or ‘Loves Popcorn’ in the branch node, one will have to consider only those rows where value of ‘Loves Soda’ is ‘Yes’.

Hence, the table for further processing will look like below: -

Those who don’t love soda are in the leaf node

So, now keeping our target variable same we will only consider those where ‘Loves Soda’ = ‘Yes’ and now compare the Gini Impurity of other columns with respect to ‘Loves Cool As Ice’.

Hence, after comparing the Gini Impurity of other columns the final tree would look like this, you can try that yourself also: -

Looking at the above results, the final tree can be interpreted as:

Decision Tree

Now, if a new data comes, one can predict the target variable by going down the above tree,

As you can see, in the current case, on one record reaches to the leaf which says doesn’t love as cool as ice, so it might be possible that our tree is memorizing the data or overfitting the data.

So, to prevent this problem there are two ways:

  1. Pruning.
  2. Putting a threshold value, that a leaf can only contain 3 or more records otherwise a leaf will not be generated.

--

--