Decision Tree Algorithm..

Published in

Analytics Vidhya

6 min readJul 22, 2020

Decision tree is one of the popular machine learning algorithms which is the stepping stone to understand the ensemble techniques using trees.

Also, Decision Tree algorithm is a hot topic in many of the interviews which are conducted related to data science field.

Understanding Decision Tree..

Decision Tree is more of a kind of Management tool which is used by many professionals to take decisions regarding the resource costs, decision to be made on the basis of filters applied.

The best part of a Decision Tree is that it is a non-parametric tool, which means that there are no underlying assumptions about the distribution of the errors or the data. It basically means that the model is constructed based on the observed data.

They are adaptable at solving any kind of problem at hand (classification or regression). Decision Tree algorithms are referred to as CART (Classification and Regression Trees).

Common terms used with Decision trees:

Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Max_Depth: The complete journey of a tree from root till the leaf nodes.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.

classic example to demonstrate a Decision Tree

How a Decision Tree works!

First, we will iterate through all possible splits, calculate the purity of each split and pick the split.
The purity is compared across all labels and best is chosen. This makes the root node as best predictor.
This algorithm is recursive in nature as the groups formed can be sub-divided and the process is repeated till the tree is fully grown.

Main Decision Areas:

Determine the best split:

The node with homogeneous class distribution are preferred.

2. Measures of Node Impurity: Below are the measures of the impurity

(a). Gini Index

(b). Entropy

(c). Mis-classification error

Understanding each terminologies with the example:

Let us take a dataset- weather, below is the snapshot of the header of the data:

Now according to the algorithm written above and the decision points to be considered, we need the feature having maximum information split possible.

Note: At the root node, the impurity level will be maximum with negligible information gain. As we go down the tree, the Entropy reduces with maximizing the Information gain.Therefore, we choose a feature with maximum gain achieved.

Therefore, calculating the impurity measure for weather dataset using Entropy:

For every feature we will calculate the entropy, for e.g. outlook and windy is calculated as below:

After calculating for all features, the particular feature having the maximum impurity measure(Entropy) will be selected for the root node.

Below is the summary for all the features:

So our root node is Outlook.

Repeat the same thing for sub-trees till the full tree is grown. Below is the final Decision Tree:-

Gini Index:

In Machine Learning sci-kit learn, gini index is used as the default method to evaluate impurity. However, the result in the outcome hardly makes any difference while using entropy or gini but since there are two different measures therefore, we should know both of them.

Gini Index for Binary Target variable is:-

A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes.

We calculate it for every row and split the data accordingly in our binary tree. We repeat this process recursively.

For Binary Target variable, Max Gini Index value:

= 1 — (1/2)² — (1/2)²
= 1–2*(1/2)²
= 1- 2*(1/4)
= 1–0.5
= 0.5

Plot to demonstrate the Entropy and Gini index for better understanding:

Information Gain:

Less impure node requires less information to describe it. And, more impure node requires more information. Information gain is a measure to define this degree of disorganization in a system known as Entropy. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% — 50%), it has entropy of one. It chooses the split which has lower entropy compared to parent node and other splits. The lesser the entropy, the better it is.

Note:- Hyper parameter tuning is a very key step in any decision tree algorithm.

Main Hyper-parameters needed to be tuned:

max_depth: It defines the total depth of a tree, generally should be tuned or else it will lead to over-fitting of the model.
min_samples_leaf: Minimum samples required at the leaf nodes. The split point at any depth will be considered if min_samples_leaf are left in the branches.
max_leaf_nodes: Grow a tree with max leaf nodes to get best results.

The main reason for tuning these hyper-parameters is that if we will not control the growth of the tree then at last all the leaf nodes will have 1 samples with the large depth(in case of large features in a dataset) which can cause over-fitting to a great extent and hence reduce the accuracy of the model and at the same time increase the complexity of a model.

Advantages of a Decision Tree:

Decision trees are easy to interpret.
To build a decision tree it requires less data preparation from the user as there is no need to normalize or scale the data.

Disadvantages of a Decision Tree:

Generally a Decision Tree tends to over-fit the data which causes the increase of model complexity as well as the increase of variance in the model.
Decision Tree is also called as a Greedy Algorithm as a small change in dataset can have large impact on overall model.

Nevertheless, Decision tree is the base model which is always useful for all Machine learning professionals as it also helps to visualize the dataset distribution and all tells the best features in our dataset.

Decision Tree Algorithm..

Information Gain:

Written by Uphaar Darbari