Decision tree algorithm for classification; A basic introduction with example in python

8 min readMay 11, 2022

Imagine, you have received a new offer and want to determine whether or not you are going to accept it or reject it. You reflect on consideration of countless factors; beginning from salary, distance and going back and forth time to the office, different perks and benefits, professional growth, and so on. But while making the final decision, you do not select all the elements at one time. In your mind you process this entire information through a sequence of if-else branches like the photo below:

Decision tree in real life (Image Ref: https://regenerativetoday.com/)

You begin wondering about the earnings first; which turns into the most important thing or beginning factor for your analysis. If earnings standards are met and it is above $50K then you assume if the travel time to the workplace is greater than an hour or less? If the workplace is close by and you can attain it easily, then you begin questioning whether or not the workplace presents you with espresso and different perks. Gradually if all these stipulations are met, you subsequently accept the offer.

Definition of Decision tree algorithm

Decision tree algorithm functions in the same way. It’s the real-world replica of how a human brain thinks and decides with a series of clarifications and asks questions each one at a time.

Definition- In ML, a decision tree is a supervised algorithm that uses a sequence of if-else-based structures to obtain the predictions that result from a series of feature-based splits. It starts with a root node and ends with a decision made by leaves.

Like, in the beginning, the example we discussed, root node is the salary criteria and terminal nodes or leaf nodes are the end decision (whether you are accepting the job offer or not)

Some quick definitions

· Root Nodes — The node at the beginning of a decision tree from which the population starts getting dividing according to various features.

· Decision Nodes — The nodes we get after splitting the root nodes are called Decision Node

· Leaf Nodes — The nodes where further splitting is not possible are called leaf nodes or terminal nodes

A decision tree in general is termed a Classification and Regression Tree (CART) and can be used for both continuous variable predictions and classification problems. However, in this article, we will prevent ourselves with an actual world instance in classification.

Growing a tree
A decision tree algorithm begins at the root node and proceeds downward in search of the most homogeneous set of points. The end objective of a decision tree algorithm is to create divisions at nodes such that the ensuing nodes (set of observations or points) are as similar as possible.

As can be seen in the below figure, node A is the most impure node with an equal mix of blue and yellow dots, node C is all blue and the purest set of data points and node B falls in between node A and C.

Three nodes with three different purity levels

Concept of entropy and splits

Let us now try to understand what exactly is meant by entropy. Let us now go back to our Physics 101 lesson in thermodynamics. Remember, what Entropy means there?

As stated in Wikipedia: “Entropy is a scientific concept as well as a measurable physical property that is most commonly associated with a state of disorder, randomness, or uncertainty”. (Ref.:https://en.wikipedia.org/wiki/Entropy)

Similar to the concept we learned during our high school physics, in the context of decision trees too, entropy measures the degree of randomness in a node.

Entropy is calculated as :

Here pi is the proportion of a certain class of data within a node. For example, the proportion of blue dots in node B in the above picture is 12/15 or 3/5. Similarly, the proportion of yellow dots is 3/15 or 1/5.

Hence entropy at node B will be ; E(B)= -(3/5)*log(3/5) –(1/5) *log(1/5) = 0.133+0.69=0.823

Methods of splitting and growing a tree (concept of information gain)

While constructing a decision tree model it turns out to be very vital in selecting the proper predictor for splitting and developing the tree from top to down. To achieve a proper set of features, the idea of Information gain is used which is developed on the principle of maximum entropy reduction whilst traversing from the root node to the bottom node by using an appropriate set of independent variables or features.

Growing a Tree (Ref.: https://www.istockphoto.com)

The concept of information gain is presented below

Say in an actual exercise, you want to figure out which aspect is greater vital among Energy degree and motivation for going to the gym. While exploring this the following set of responses in the structure of the decision tree had been located.

Say, we asked 30 people; out of them 16 will go to the gym and 14 said they will not go to the gym on a specific day. Now, when we asked how motivated they are feeling for going to the gym. 8 of them said they are not motivated (Out of the 8; 7 will go to the gym and 1 will not). similarly, 10 people said they are neutral here and 12 said they feel highly motivated.

A similar question was asked on the energy level front. The response is represented below as a tree.

Similarly, it can be shown using a similar set of calculations, that the information gained if we select the feature Motivation would be = 0.13

Therefore, it’s evident that information gain or reduction in entropy would be higher if we chose if we selected Energy as the subsequent characteristic, and for this reason, the tree would choose “Energy” as the next splitting criteria.

The split with the highest information gain will be taken as the first split and the process will continue until all children nodes are pure, or until the information gain is 0. That’s the reason, decision tree algorithms are also called greedy algorithms as they construct the tree till every node turns pure.

However, developing a tree to attain the purest set of nodes may not be continually possible owing to the computational requirements and over-fitting possibility of the training data. That is why the idea of pruning originated. The growth of the decision tree can be limited by reducing the branches with the usage of hyper-parameter tuning or through cost complexity pruning. We are not discussing those procedures here as that would require a different discussion altogether. However, we should acknowledge that growing a tree beyond the normal height would always over-fit the training data and the model would perform poorly on the validation set. A classic example of a bias-variance trade-off.

Example with python implementation

We will now focus on a real-world problem and implementation of the decision tree algorithm using the Scikit learn package of python.

We introduce this data set from UCI Machine Learning repository.

This data set is taken from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the model would be to predict whether or not a patient has diabetes, based on certain features included in the data set.

The dependent variable, in this case, is a 0 or 1 binary flag. Where 0 stands for a non-diabetic patient and 1 means the patient is diabetic.

More details on the data can be found in this link

First we start by importing the required libraries and also setting the column names for the data set for easy interpretation.

The first 5 rows of the data set looks like below:

After loading the data, we understand the structure and variables, determine the target and feature variables and also divide the data into training and testing sets in the ratio of 70:30.

After the original data is divided between the training and validation part, we now train the decision tree model. Hereafter, the model is used to predict the results of the testing set.

While training the model, we have also performed some hyper-parameter tuning. For example, in this case, we have restricted the max_depth option to 3.

Code snippet for model development and validation

the below snippet represents the model accuracy achieved across training and testing data sets:

Accuracy of the decision tree model on training and testing data sets

In the final step; we try to visualize the model using pydotplus and graphviz packages within the sklearn framework.

Code snippet for visualization of the decision tree model

The final structure of the model looks like below :

Visual representation of the decision tree model

Interpretation of the model

As observed the splitting of the tree starts with the first variable as “glucose”. Depending on the degree of glucose in a person’s blood the subsequent variable that the decision tree model takes into consideration is “BMI” or Body Mass Index. Similarly, the third important feature here is “age”.

Conclusion

In this article, we have discussed and introduced the concept of decision tree for the classification problem. We have additionally touched upon how a tree is constructed with an actual example. I hope this would be a good beginning for all people to study and observe the decision tree algorithm. This would help someone to start their ML journey. Further reading and study materials can be explored on hyper-parameter tuning and application of tree-based methods for continuous variables too.

Disclaimer: Views and opinions expressed in this blog are personal, and belong solely to the author, and not necessarily to the author’s employer, organization, committee or other group or individual.

Decision tree algorithm for classification; A basic introduction with example in python

Definition of Decision tree algorithm

Methods of splitting and growing a tree (concept of information gain)

Example with python implementation

Conclusion

Written by Kinsuk Ghatak