An Intro to Machine Learning — Decision Tree Classifier

Gilang Satria Prayoga
7 min readMar 17, 2022

--

Explaining a tree-based classifier used in Supervised Machine Learning classification problems.

Photo by Pauline Bernfeld on Unsplash

Decision Tree is a Supervised learning technique that can be used for classifications problem. It is a tree-structured classifier, like the picture below

Figure 1. Decision Tree. image by the Author

Before we dive deeper into decision trees, let’s get familiar with some of the terminologies.

  1. Root Nodes — Root node is from where the decision tree starts. It is the node present at the beginning of a decision tree from this node the population starts dividing according to various features.
  2. Decision Nodes — The nodes we get after splitting the root nodes are called Decision Node
  3. Terminal/Leaf Nodes — Terminal/Leaf nodes are the final output node, and the tree cannot be split further after getting a leaf node.
  4. Sub-tree/Branch — Small portion of tree
  5. Pruning —Pruning is the process of removing a sub-tree/branch

Let’s see an example so we can understand better. Lets say Bob a 35 years old man, comes into a hospital with a sore in his throat. Imagine you are the doctor who attend to Bob’s sick. You would ran a covid-19 test, but the test kit are out of stock. You need an accurate way to decide whether Bob has covid-19 infection or not. You’ve got the dataset of covid-19 patients from the hospital.

Table 1. Imaginary covid-19 table

Your task as a doctor (and a data scientist) is to create a decision tree and predict whether Bob should be treated as covid-19 patient or not. To build a tree, you always need a root. But how to determine the root node of the tree? to do that, we need to understand the concept of impurity

Impurity

To understand impurity, imagine you have three boxes, each box is contained by 4 balls. The first box has 4 black balls, the second box has 3 black balls and 1 white balls, the last box has 2 black balls and 2 white balls. Suppose, a ball is randomly drawn from each box, how much information you needed to accurately tells the color of ball?

From information measurement point of view, the first box needed less information because all of balls inside it are black balls. The second box needed more information, because it has 1 white ball. and the last box needed maximum information as both number of both color ball are same. As information is measure of purity, so we can say that first box is pure, second box is less impure and the third box is more impure. So, how is impurity related to Decision Tree? In decision tree algorithm, the purest information of the samples is the root node, thus the decision tree split the samples based on the measurement of impurity.

So, how we measure the impurity? Well, there are two among many ways:

  1. Entropy
  2. Gini Impurity

Entropy

Entropy is a common concept used in science, engineering, and information theory. In simple term, Entropy is measurement of “surprise”. Lets see our example of three boxes. The probability of black balls drawn from box 1 is high, is equal to 1, because all the balls are black. Thus, the surprise for drawing a not-a-black ball is zero, so the entropy of first box is zero. So if sample is homogeneous, then Entropy is 0, else if sample is equally divided than entropy is maximum 1. So, first box has lowest entropy, second box has more entropy and third box has highest entropy. Entropy mathematically written as

Equation 1. Entropy

Gini Impurity

It measures the impurity of the sample. It has value between 0 and 1 and is calculated as:

Equation 2. Gini Impurity

Gini impurity of value 0 means sample are perfectly homogeneous, as we saw on the first box of the example, whereas, Gini index of value 1 means maximum inequality among elements.

Splitting the Nodes

Lets calculate the impurity of our datasets, for simplicity, we’re going to use the Gini Impurity method.

For Shore Throat Feature we have

Table 2. shore throat frequencies

The Gini Impurity for Shore Throat are:

Gini Impurity(Shore Throat=Yes)=1–(5/8)²–(3/8)²=0.468

Gini Impurity(Shore Throat=No)=1–(3/5)²–(2/5)²=0.480

The weighted sum of Gini Impurity for shore throat are

Gini Impurity(Shore Throat)=(8/13)*0.468 + (5/13)*0.480=0.473

For Temperature, we define the threshold 37.5 as the threshold

Table 3. Temperature frequencies

The Gini Impurity for Temperature are:

Gini Impurity(Temperature > 37.5)=1–(6/8)²–(2/8)²=0.375

Gini Impurity(Temperature ≤ 37.5)=1–(2/5)²–(3/5)²=0.480

The weighted sum of Gini Impurity for temperature is

Gini Impurity(Temperature)=(8/13)*0.375+ (5/13)*0.480=0.415

For Cough we have

Table 4. Cough frequencies

The Gini Impurity for Cough are:

Gini Impurity(Cough =Dry)=1–(3/6)²–(3/6)²=0.5

Gini Impurity(Cough = Phlegm)=1–(5/7)²–(2/7)²=0.408

The weighted sum of Gini Impurity for cough is

Gini Impurity(Cough)=(6/13)*0.5+ (7/13)*0.408=0.450

The final result are:

Gini Impurity(Shore Throat)=0.473

Gini Impurity(Temperature)=0.415

Gini Impurity(Cough)=0.450

From the result, the Gini Impurity for Temperature is the lowest. Thus, the temperature feature is the root node

Figure 2. Temperature as a root node. image by the Author

Now lets focus to calculate the temperature feature. We need to find Gini Impurity for shore throat and cough for temperature > 37.5

Table 5. Covid-19 table for temperature above 37.5

For Short Throat we have

Table 6. Frequency of shore throat for temperature above 37.5

The Gini Impurity for temperature > 37.5 and Shore Throat are:

Gini Impurity(Temperature > 37.5, Shore Throat=Yes)=1–(5/6)²–(1/6)²=0.27

Gini Impurity(Temperature > 37.5, Shore Throat=No)=1–(1/2)²–(1/2)²=0.5

The weighted sum of Gini Impurity for temperature > 37.5 and shore throat are

Gini Impurity(Temperature > 37.5, Shore Throat)=(6/8)*0.27 + (2/8)*0.5=0.333

For Cough we have

Table 7. Frequency of cough for temperature above 37.5

The Gini Impurity for Temperature > 37.5 and Cough are:

Gini Impurity(Temperature > 37.5,Cough =Dry)=1–(3/3)²–(0/3)²=0

Gini Impurity(Temperature > 37.5,Cough = Phlegm)=1–(3/5)²–(2/5)²=0.48

The weighted sum of Gini Impurity for cough is

Gini Impurity(Temperature > 37.5,Cough)=(3/8)*0+ (5/8)*0.408=0.3

The final result are:

Gini Impurity(Temperature > 37.5, Shore Throat)=0.333

Gini Impurity(Temperature > 37.5,Cough)=0.30

From the result, the Gini Impurity for Temperature > 37.5 and Cough is the lowest. Thus, the cough feature is the decision node

Now lets see the table given temperature > 37.5 and dry cough

Table 8. Frequency of dry cough for temperature above 37.5

decision is always ‘Yes’ for Covid-19 when cough is ‘dry’. So we got leaf node. now decision tree can be viewed as

This is enough to make you understand the whole process of information impurity measurement for decision tree. From here on, we could continue for Shore Throat and Phlegm cough and we could also determine the decision nodes for temperature below 37.5 branch through same process as described above. As for now lets trying to implement the Decision Tree algorithm in Python

Implementation

For implementation in Python, We are going to use the Scikit-Learn module with iris datasets

The first thing we do is load a lot of python modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.datasets import load_iris

Load the dataset

iris = load_iris()
X, y = iris.data, iris.target

Train and test split

X_test,X_train,y_test,y_train=train_test_split(X,y,test_size=0.2,random_state=1234)

The Decision Tree Classifier instance (you could use the entropy for impurity measurement by adding the argument criterion='entropy', the default is Gini Impurity

clf = DecisionTreeClassifier()

Then you can train the Classifier (this is building the tree process)

clf.fit(X,y)

You can plot the created three using this

plot_tree(clf,filled=True,rounded=True,class_names=X.columns,feature_names=y)

Now for predicting the test dataset

clf.predict(X_test)

Conclusion

In this article, we have learned:

  1. The Structure of Decision Tree
  2. The concept of impurity and how to measure impurity for the splitting process in the the decision tree
  3. Implementing the Decision Tree classifier in Python.

--

--