Step by Step Decision Tree: ID3 Algorithm From Scratch in Python [No Fancy Library]

Tareq Rahman Joy
Geek Culture
Published in
9 min readMar 27, 2021

We all know about the algorithm of Decision Tree: ID3. Some of us already may have done the algorithm mathematically for academic purposes. If you did not already, no problem, here we will also discuss the basics as well. The algorithm is quite easy and straightforward. So, let's start!

Objectives:

  1. Knowing the basics of the ID3 Algorithm
  2. Loading csv data in python, (using pandas library)
  3. Training and building Decision tree using ID3 algorithm from scratch
  4. Predicting from the tree
  5. Finding out the accuracy

Step 1: Observing The dataset

First, we should look into our dataset, ‘Play Tennis’. It is a very famous dataset for mathematical examples. Yes! You guessed right! We are going to use this easy dataset so that you can understand the python implementation more easily.
Link of the dataset: https://www.kaggle.com/tareqjoy/trainplaytennis

Table 1: Play Tennis Dataset

After observing the dataset we can say that:
Features: Outlook, Temperature, Humidity, Wind
Label: Play Tennis (The output feature that we want to predict)
Class: Yes, No (Unique values of the label)

Step 2: Importing the necessary basic python libraries

We are going to use pandas for manipulating the dataset and numpy library for mathematical calculation.

Step 3: Reading the dataset

We are going to read the dataset (csv file) and load it into pandas dataframe. You can see below, train_data_m is our dataframe.
With the head() method of the dataframe we can view the first 5 rows.

The output will be:

The output of head() method

Step 4: Calculating the entropy of the whole dataset

In mathematics, we had to calculate the entropy of the whole dataset first like below:

Total row = 14
Row with "Yes" class = 9
Row with "No" class = 5
Complete entropy of dataset is -
H(S) = - p(Yes) * log2(p(Yes)) - p(No) * log2(p(No))
= - (9/14) * log2(9/14) - (5/14) * log2(5/14)
= - (-0.41) - (-0.53)
= 0.94

So, here we will do the same as the above equation.

Method description:
Calculates the entropy of the whole dataset.
train_data: a pandas dataframe
label: string, name of the label of the dataframe (=Play Tennis)
class_list: list, unique classes of the label (=[Yes, No]).
returns: float, calculated entropy of the whole dataframe (=0.94)

Step 5: Calculating the entropy for the filtered dataset

Using the table of step 1, The mathematics calculation will be like below for the feature Outlook:

Categorical values of Outlook - Sunny, Overcast and RainTotal count of row containing:
Sunny = 5
Sunny & Yes = 2
Sunny & No = 3
>> H(Outlook=Sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) = 0.971Total count of row containing:
Rain = 5
Rain & Yes = 3
Rain & No = 2
>> H(Outlook=Rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) = 0.971Total count of row containing:
Overcast = 4
Overcast & Yes = 4
Overcast & No = 0
>> H(Outlook=Overcast) = -(4/4)*log(4/4)-0 = 0

Note: We have to do the same for all features like Wind, Humidity etc.
The equivalent Python implementation will be like below:

Method description:
Calculates entropy of a specific feature = value.
feature_value_data: a pandas dataframe, which contains data that has a specific value of a feature (Ex. Only data with Outlook = Sunny)
label: string, name of the label of the dataframe (=Play Tennis)
class_list: list, unique classes of the label (=[Yes, No]).
returns: float, calculated entropy of the feature value dataframe (Ex. for Outlook = Sunny, returns 0.971)

Step 6: Calculating information gain for a feature

After calculating entropy, we have to calculate the information gain of that feature. In math, first, we have to calculate the information of that feature like this: (for the feature Outlook)

I(Outlook) = p(Sunny) * H(Outlook=Sunny) + p(Rain) * H(Outlook=Rain) + p(Overcast) * H(Outlook=Overcast)
= (5/14)*0.971 + (5/14)*0.971 + (4/14)*0
= 0.693

Then, we have to subtract this from the total entropy of the dataset which is the information gain of the feature.

Information Gain = H(S) - I(Outlook)
= 0.94 - 0.693
= 0.247

In python we have done like this:

Method description:
Calculates information gain of a feature.
feature_name: string, the name of the feature that we want to find the information gain (Ex. Outlook)
train_data: a pandas dataframe
label: string, name of the label of the dataframe (=Play Tennis)
class_list: list, unique classes of the label (=[Yes, No]).
returns: calculated information gain of the feature (Ex. for Outlook, returns 0.247)

Step 7: Finding the most informative feature (feature with highest information gain)

Like Outlook feature, We have to calculate information gain for every feature in the dataset. Then we have to select the feature with the highest information gain. After calculating mathematically we will find the values like below:

Information gain:
Outlook = 0.247 (Highest value)
Temperature = 0.0292
Humidity = 0.153
Wind = 0.048

As the feature Outlook has the highest value, so it will be selected for our tree node.

Method description:
Finds the most informative feature from the current dataset.
train_data: a pandas dataframe
label: string, name of the label of the dataframe (=Play Tennis)
class_list: list, unique classes of the label (=[Yes, No]).
returns: string, the feature name

Step 8: Adding a node to the tree

As we have found the feature name with the highest information gain, we have to generate a node in the tree and its value as a branch. For example, we have selected Outlook, so we have to add Outlook as a node in the tree and its value Sunny or Rain or Overcast as a branch.

If any value of the feature represents only one class (Ex. only rows with Play Tennis = ‘Yes’ or ‘No’) then we can say that the feature value represents a pure class. If the value does not represent a pure value, we have to extend it further until we find a pure class.

Outlook is selected as Node.
(Outlook = Sunny): Not pure class, contains both class Yes and No
(Outlook = Overcast): Pure class, contains only one class Yes
(Outlook = Rain): Not pure class, contains both class Yes and No
The generated tree after finding Outlook

After selecting a pure class*, we have to remove the rows from the dataset corresponding to the feature value. Ex: we have to remove rows with Outlook = Overcast. The resultant dataset will be like below:

The updated dataset (Without Outlook = Overcast)

N.B. We have to use the updated dataset for the next iterations. Ex. for information gain we have to use this as train_data

The equivalent code will be:

Method description:
Generates subtree of a feature and removes the feature = value from the dataset. The tree might contain ‘?’ as a value if the tree node isn’t a pure class.
feature_name: string, the name of the feature that we want to add to tree and shrink dataset (Ex. Outlook)
train_data: a pandas dataframe
label: string, name of the label of the dataframe (=Play Tennis)
class_list: list, unique classes of the label (=[Yes, No]).
returns: tuple (dictionary, dataframe), the tree node with it’s branches and the updated dataset

Step 9: Performing ID3 Algorithm and generating Tree

Now, we should ensemble the methods which should recursively do Step 4Step 8. So, the overall step is:

  • Finding the most informative feature
  • Making a tree node with a feature name and feature values as branches
    - If pure class, adding leaf node (= Class) to the tree node
    - If impure class, adding an expandable node (= ‘?’) to the tree node
  • Shrinking/Updating the dataset according to the pure class
  • Adding the node with branches into a tree
  • Expand the branch of the next impure class (= ‘?’) with an updated dataset

The recursion endpoint:

  • The dataset becomes empty after updating
  • There is no expandable branch (= all pure class)

Method description:
Generates a tree using dictionary of dictionaries. The leaf node of the tree would be value of a feature = class name. The resultant tree is returned by reference.
root: dictionary, which will contain the resultant tree through recursive subtree, initially it should be an empty dictionary, after the recursive calls it should contain the result
prev_feature_value: Any datatype (Int or Float or String etc.) depending on the datatype of the previous feature, the previous value of the pointed node/feature, initially it should be None
train_data: a pandas dataframe
label: string, name of the label of the dataframe (=Play Tennis)
class_list: list, unique classes of the label (=[Yes, No])
returns: None

Step 10: Finding unique classes of the label and Starting the algorithm

First, we need to see what should be the expected tree:

The complete tree

We can start calling the recursive tree building algorithm of ID3 after finding the class names.

Method description:
Generates id3 tree.
train_data_m: a pandas dataframe
label: string, name of the label of the dataframe (=Play Tennis)
returns: (nested) dictionary, the decision tree

This is how we can start the id3:

If we print the tree we can see like that:

{
'Outlook':
{
'Rain':
{
'Wind':
{
'Strong': 'No',
'Weak': 'Yes'
}
},
'Sunny':
{
'Humidity':
{
'High': 'No',
'Normal': 'Yes'
}
},
'Overcast': 'Yes'
}
}

The root of the tree/dictionary is Outlook. The value of the key is another dictionary where the key corresponds to the value of the feature (= ‘Rain’, ‘Overcast’, ‘Sunny’) and so on. At the end of each nested dictionary is a class name (= ‘Yes’, ‘No’).
We can see the generated tree is identical to the expected output.
So, our algorithm is working like a charm!

Step 11: Predicting from the tree

We have generated the tree. Now, we can use the tree for prediction. We will recursively traverse the nested dictionary until any leaf node (= class) is found. Remember that, the key of the dictionary is feature name which is the datatype of string, and the value is either a dictionary which means we have to traverse the tree more or any basic datatype like string or int or float depending on the class data type, which means it’s a leaf node.

If we want to predict Outlook = ‘Rain’ and Wind = ‘Strong’ we have to traverse the dictionary like that:

tree[‘Outlook’][‘Rain’][‘Wind’][‘Strong’]

Method description:
Predicts from the generated tree using the feature set/instance
tree: dictionary (of dictionaries), a decision tree
instance: a row or snapshot or set of the features of dataset. The row may not contain label
returns: Any datatype (Int or Float or String etc.) depending on the datatype of the class, the predicted class

Step 12: Evaluating test dataset

To evaluate the model i.e. the tree we need a labeled test dataset. Then after predicting we can calculate the difference of actual and predicted value in percentage.

Method description:
Evaluates the accuracy of a id3 tree by testing against the expected result
tree: dictionary (of dictionaries), a decision tree
test_data_m: a pandas dataframe/test dataset
returns: float, the accuracy of the tree

Step 13: Checking test dataset and Evaluating it

The test dataset looks like below:

The test dataset

Link of the test dataset: https://www.kaggle.com/tareqjoy/testplaytennis
Here, the Play Tennis label acts as expected output which will be used to find the difference from the predicted output.

If we print the accuracy we will get 1.00 which means our accuracy is 100% correct. But, in a large dataset, the accuracy may not be like this.

--

--

Tareq Rahman Joy
Geek Culture

Soft Dev Eng @Amazon, who loves to learn, explore and implement