An Intro to Machine Learning — Decision Tree Classifier
Explaining a tree-based classifier used in Supervised Machine Learning classification problems.
Decision Tree is a Supervised learning technique that can be used for classifications problem. It is a tree-structured classifier, like the picture below
Before we dive deeper into decision trees, let’s get familiar with some of the terminologies.
- Root Nodes — Root node is from where the decision tree starts. It is the node present at the beginning of a decision tree from this node the population starts dividing according to various features.
- Decision Nodes — The nodes we get after splitting the root nodes are called Decision Node
- Terminal/Leaf Nodes — Terminal/Leaf nodes are the final output node, and the tree cannot be split further after getting a leaf node.
- Sub-tree/Branch — Small portion of tree
- Pruning —Pruning is the process of removing a sub-tree/branch
Let’s see an example so we can understand better. Lets say Bob a 35 years old man, comes into a hospital with a sore in his throat. Imagine you are the doctor who attend to Bob’s sick. You would ran a covid-19 test, but the test kit are out of stock. You need an accurate way to decide whether Bob has covid-19 infection or not. You’ve got the dataset of covid-19 patients from the hospital.
Your task as a doctor (and a data scientist) is to create a decision tree and predict whether Bob should be treated as covid-19 patient or not. To build a tree, you always need a root. But how to determine the root node of the tree? to do that, we need to understand the concept of impurity
Impurity
To understand impurity, imagine you have three boxes, each box is contained by 4 balls. The first box has 4 black balls, the second box has 3 black balls and 1 white balls, the last box has 2 black balls and 2 white balls. Suppose, a ball is randomly drawn from each box, how much information you needed to accurately tells the color of ball?
From information measurement point of view, the first box needed less information because all of balls inside it are black balls. The second box needed more information, because it has 1 white ball. and the last box needed maximum information as both number of both color ball are same. As information is measure of purity, so we can say that first box is pure, second box is less impure and the third box is more impure. So, how is impurity related to Decision Tree? In decision tree algorithm, the purest information of the samples is the root node, thus the decision tree split the samples based on the measurement of impurity.
So, how we measure the impurity? Well, there are two among many ways:
- Entropy
- Gini Impurity
Entropy
Entropy is a common concept used in science, engineering, and information theory. In simple term, Entropy is measurement of “surprise”. Lets see our example of three boxes. The probability of black balls drawn from box 1 is high, is equal to 1, because all the balls are black. Thus, the surprise for drawing a not-a-black ball is zero, so the entropy of first box is zero. So if sample is homogeneous, then Entropy is 0, else if sample is equally divided than entropy is maximum 1. So, first box has lowest entropy, second box has more entropy and third box has highest entropy. Entropy mathematically written as
Gini Impurity
It measures the impurity of the sample. It has value between 0 and 1 and is calculated as:
Gini impurity of value 0 means sample are perfectly homogeneous, as we saw on the first box of the example, whereas, Gini index of value 1 means maximum inequality among elements.
Splitting the Nodes
Lets calculate the impurity of our datasets, for simplicity, we’re going to use the Gini Impurity method.
For Shore Throat Feature we have
The Gini Impurity for Shore Throat are:
Gini Impurity(Shore Throat=Yes)=1–(5/8)²–(3/8)²=0.468
Gini Impurity(Shore Throat=No)=1–(3/5)²–(2/5)²=0.480
The weighted sum of Gini Impurity for shore throat are
Gini Impurity(Shore Throat)=(8/13)*0.468 + (5/13)*0.480=0.473
For Temperature, we define the threshold 37.5 as the threshold
The Gini Impurity for Temperature are:
Gini Impurity(Temperature > 37.5)=1–(6/8)²–(2/8)²=0.375
Gini Impurity(Temperature ≤ 37.5)=1–(2/5)²–(3/5)²=0.480
The weighted sum of Gini Impurity for temperature is
Gini Impurity(Temperature)=(8/13)*0.375+ (5/13)*0.480=0.415
For Cough we have
The Gini Impurity for Cough are:
Gini Impurity(Cough =Dry)=1–(3/6)²–(3/6)²=0.5
Gini Impurity(Cough = Phlegm)=1–(5/7)²–(2/7)²=0.408
The weighted sum of Gini Impurity for cough is
Gini Impurity(Cough)=(6/13)*0.5+ (7/13)*0.408=0.450
The final result are:
Gini Impurity(Shore Throat)=0.473
Gini Impurity(Temperature)=0.415
Gini Impurity(Cough)=0.450
From the result, the Gini Impurity for Temperature is the lowest. Thus, the temperature feature is the root node
Now lets focus to calculate the temperature feature. We need to find Gini Impurity for shore throat and cough for temperature > 37.5
For Short Throat we have
The Gini Impurity for temperature > 37.5 and Shore Throat are:
Gini Impurity(Temperature > 37.5, Shore Throat=Yes)=1–(5/6)²–(1/6)²=0.27
Gini Impurity(Temperature > 37.5, Shore Throat=No)=1–(1/2)²–(1/2)²=0.5
The weighted sum of Gini Impurity for temperature > 37.5 and shore throat are
Gini Impurity(Temperature > 37.5, Shore Throat)=(6/8)*0.27 + (2/8)*0.5=0.333
For Cough we have
The Gini Impurity for Temperature > 37.5 and Cough are:
Gini Impurity(Temperature > 37.5,Cough =Dry)=1–(3/3)²–(0/3)²=0
Gini Impurity(Temperature > 37.5,Cough = Phlegm)=1–(3/5)²–(2/5)²=0.48
The weighted sum of Gini Impurity for cough is
Gini Impurity(Temperature > 37.5,Cough)=(3/8)*0+ (5/8)*0.408=0.3
The final result are:
Gini Impurity(Temperature > 37.5, Shore Throat)=0.333
Gini Impurity(Temperature > 37.5,Cough)=0.30
From the result, the Gini Impurity for Temperature > 37.5 and Cough is the lowest. Thus, the cough feature is the decision node
Now lets see the table given temperature > 37.5 and dry cough
decision is always ‘Yes’ for Covid-19 when cough is ‘dry’. So we got leaf node. now decision tree can be viewed as
This is enough to make you understand the whole process of information impurity measurement for decision tree. From here on, we could continue for Shore Throat and Phlegm cough and we could also determine the decision nodes for temperature below 37.5 branch through same process as described above. As for now lets trying to implement the Decision Tree algorithm in Python
Implementation
For implementation in Python, We are going to use the Scikit-Learn module with iris datasets
The first thing we do is load a lot of python modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrixfrom sklearn.datasets import load_iris
Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
Train and test split
X_test,X_train,y_test,y_train=train_test_split(X,y,test_size=0.2,random_state=1234)
The Decision Tree Classifier instance (you could use the entropy for impurity measurement by adding the argument criterion='entropy'
, the default is Gini Impurity
clf = DecisionTreeClassifier()
Then you can train the Classifier (this is building the tree process)
clf.fit(X,y)
You can plot the created three using this
plot_tree(clf,filled=True,rounded=True,class_names=X.columns,feature_names=y)
Now for predicting the test dataset
clf.predict(X_test)
Conclusion
In this article, we have learned:
- The Structure of Decision Tree
- The concept of impurity and how to measure impurity for the splitting process in the the decision tree
- Implementing the Decision Tree classifier in Python.