Decision Tree — Implementation From Scratch in Python.

To make a decision … ask a tree.

ranga_vamsi

Published in

The Startup

9 min readJul 14, 2020

Introduction

Decision Tree Algorithm belongs to a class of non-parametric supervised Machine Learning algorithms used for solving both regression and classification tasks. In general, we address it as Classification and Regression Tree(CART).

As per Wikipedia, A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from the root to the leaf represent classification rules.

The algorithm for building the decision tree breaks down data into homogenous partitions using binary recursive partitions. The most discriminative feature is first selected as the root node to partition the data into branch nodes by learning decision rules. The final build is a tree with internal or decision nodes and leaf nodes. Decision nodes have two or more branches representing attributes(or features), the branch represents a decision rule, and each leaf nodes correspond to a class label(Classification) or decision(Regression).

How does the tree make decisions?

If the dataset consists of N features, how does the tree decide which attribute to place at the root node, or the internal nodes and arrive at the leaf node?

Attribute Selection Measures

An Attribute selection measure is a heuristic for selecting the splitting criterion that partition the data into the best possible way. It is also known as decision rules or splitting rules. There are many selection measures namely, Information gain, Gain Ratio, Gini Index, Reduction in Variance, Chi-Square.

1. Information Gain(IG)

Entropy measures impurity or disorder or uncertainty in the data. It controls the decision tree in making split decisions.

Mathematically,

Information Gain is a statistical property that measures how much information a feature gives about the class. It gives a decrease in entropy. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute value.

Information = entropy(parent) - Weights Average * entropy(Children)
     Gain

Mathematically,

entropy(parent)

Where,

pi is the probability that an arbitrary tuple in D belongs to class Ci.
Info(D) is the mean amount of information required to identify the class of a tuple in D.
|Dj|/|D| is the weight of the jth split.
v is the number of discrete values in attribute A.
InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by attribute A.

The attribute(A) with the highest information gain(Gain(A)) is chosen as the splitting attribute.

ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.

2. Gain Ratio

The information gain is biased towards choosing the attributes with many distinct values. Gain ratio overcomes the problem with information gain by normalizing the information gain with Split Info.

Mathematically,

The attribute(A) with the highest Gain Ratio(GainRatio(A)) is chosen as the splitting attribute.

C4.5, an improvement of ID3, uses the Gain ratio.

3. Gini Index

Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one. Mathematically,

pi is the probability that a tuple in the dataset belongs to class Ci.

The Gini Index performs a binary split for each attribute. It favors larger partitions whereas information gain favors smaller partitions with distinct values.

Mathematically,

Gini index of D, if binary split on Attribute A partitions data D into D1 and D2

Gini index of the attribute A

The attribute(A) with the minimum Gini Index(Gini(A)) is chosen as the splitting attribute.

CART (Classification and Regression Tree) uses the Gini method.

Decision Tree Classification algorithm

I would like to walk you through a simple example along with the python code.

Step 1. We start by importing dataset and necessary dependencies

We will be using the weather dataset for the training phase. This dataset includes features [Outlook, Temp, Humidity, Windy], and the corresponding target variable ‘Play’. Now, we need to predict whether players will play or not based on given weather conditions.

#Weather Dataset
Outlook   Temp    Humidity   Windy  Play
Rainy     45.5    High       f      no
Rainy     40.0    High       t      no
Overcast  38.2    High       f      yes
Sunny     25.5    High       f      yes
Sunny     20.0    Normal     f      yes
Sunny     12.5    Normal     t      no
Overcast  5.0     Normal     t      yes
Rainy     25.0    High       f      no
Rainy     10.0    Normal     f      yes
Sunny     28.0    Normal     f      yes
Rainy     22.5    Normal     t      yes
Overcast  24.0    High       t      yes
Overcast  35.0    Normal     f      yes
Sunny     26.0    High       t      no

Loading Dataset

Step 2: Build the Decision Tree

We will be using Information Gain as an attribute selection measure for partitioning the dataset. We need to go through each feature/column and check which feature is a good split.

2.1 Find the attribute with the highest information gain.

#Outlook
Categorical FeatureInfo(D) = -(9/14)*log2(9/14) - (5/14)*log2(5/14)
        = 0.94Info[Outlook](D) = Weight[Overcast] * Info[Overcast](D)
           + Weight[Rainy] * Info[Rainy](D) 
           + Weight[Overcast] * Info[Sunny](D) - Info[Overcast](D) = -(4/4)*log2(4/4) - (0/4)*log2(0/4)
                     = -(0) - (0) = 0.0 - Info[Rainy](D) = -(2/5)*log2(2/5) - (3/5)*log2(3/5)
                  = -(-0.52) - (-0.44) = 0.96 - Info[Sunny](D) = -(3/5)*log2(3/5) - (2/5)*log2(2/5)
                  = -(-0.44) - (-0.52) = 0.96InfoA(D) = (4/14)*(0.0) + (5/14)*(0.96) + (5/14)*(0.96)
         = 0.0 + 0.34 + 0.34 = 0.68Information Gain(Outlook) = Info(D) - Info[Outlook](D)
                          = 0.94 - 0.68
                          = 0.26#Temp
Numerical FeatureInfo(D) = -(9/14)*log2(9/14) - (5/14)*log2(5/14)
        = 0.94Info[Temp](D) = Weight[<=threshold] * Info[Temp<=threshold](D) 
                + Weight[>threshold] * Info[Temp>threshold](D)#Since 'Temp' feature consists of continuous values, we should figure out the best threshold value(or Cutoff Value) for splitting data into two splits(one contains tuples with 'Temp<=threshold' and the other with 'Temp>threshold')#Here, I'm going to traverse all the rows and take the mean of two consecutive rows of Temp attribute as a Cutoff. 
you can choose other methods as well for choosing cutoff value depending upon the dataset.Iteration 1: 
Cutoff = (45.5 + 40.0) / 2 = 42.75Info[Temp<=42.75](D) = -(9/13)*log2(9/13) - (4/13)*log2(4/13)
                     = -(-0.37) - (-0.52) = 0.89
Info[Temp>42.75](D) = -(0/1)*log2(0/1) - (1/1)*log2(1/1)
                    = 0.0
Info[Temp](D) = (13/14)*(0.89) + (1/14)*(0.0)
              = 0.83Iteration 2:
Cutoff = (40.0 + 38.2) / 2 = 39.1Info[Temp<=39.1](D) = -(9/12)*log2(9/12) - (3/12)*log2(3/12)
                     = -(-0.31) - (-0.5) = 0.81
Info[Temp>39.1](D) = -(0/2)*log2(0/1) - (2/2)*log2(2/2)
                    = 0.0
Info[Temp](D) = (12/14)*(0.81) + (2/14)*(0.0)
              = 0.69.
.
.Similary, repeat the above step till you cover all rows and record the Cutoff value for which entropy is minimum.After iterating, We get minimum entropy equals to 0.69 at Cutoff=39.1.
Therefore, Information Gain(Temp) = Info(D) - Info[Temp,Cutoff=39.1](D)
                          = 0.94 - 0.69
                          = 0.25#Humidity
Categorical FeatureInfo(D) = -(9/14)*log2(9/14) - (5/14)*log2(5/14)
        = 0.94Info[Humidity](D) = Weight[High] * Info[High](D)
           + Weight[Normal] * Info[Normal](D) - Info[High](D) = -(3/7)*log2(3/7) - (4/7)*log2(4/7)
                = -(-0.52) - (-0.46) = 0.98- Info[Normal](D) = -(6/7)*log2(6/7) - (1/7)*log2(1/7)
                  = -(-0.19) - (-0.40) = 0.59InfoA(D) = (7/14)*(0.98) + (7/14)*(0.59) 
         = 0.49 + 0.30 = 0.79Information Gain(Outlook) = Info(D) - Info[Outlook](D)
                          = 0.94 - 0.79
                          = 0.15
#Windy
Categorical FeatureInfo(D) = -(9/14)*log2(9/14) - (5/14)*log2(5/14)
        = 0.94Info[Windy](D) = Weight[f] * Info[f](D)
                + Weight[t] * Info[t](D)- Info[f](D) = -(6/8)*log2(6/8) - (2/8)*log2(2/8)
             = -(-0.31) - (-0.5) = 0.81- Info[t](D) = -(3/6)*log2(3/6) - (3/6)*log2(3/6)
             = -(-0.5) - (-0.5) = 1.0InfoA(D) = (8/14)*(0.81) + (6/14)*(1.0) 
         = 0.46 + 0.43 = 0.89Information Gain(Outlook) = Info(D) - Info[Outlook](D)
                          = 0.94 - 0.89
                          = 0.05

Now, we can see that the attribute Outlook has the highest information gain.

2.2 Make the attribute with the highest information gain as a decision node and split the dataset accordingly.

Now, we make the attribute ‘Outlook’ as a decision node and partition the dataset into smaller subsets.

#Child1
Overcast  38.2    High       f      yes
Overcast  5.0     Normal     t      yes
Overcast  24.0    High       t      yes
Overcast  35.0    Normal     f      yes#Child-2
Rainy     45.5    High       f      no
Rainy     40.0    High       t      no
Rainy     25.0    High       f      no
Rainy     10.0    Normal     f      yes
Rainy     22.5    Normal     t      yes#Child-3
Sunny     25.5    High       f      yes
Sunny     20.0    Normal     f      yes
Sunny     12.5    Normal     t      no
Sunny     28.0    Normal     f      yes
Sunny     26.0    High       t      no

Since Node ‘Child-1’ has all rows with the same target values, we can consider that Child-1 has Pure node or Leaf Node.
Nodes Child-2 and Child-3 are impure since they consist of rows with mixed target values. So, we can either proceed to partition these nodes further till we get pure nodes or we can specify several parameters, such as max_depth, which is the maximum of depth you want the tree to build, min_sample_leaf, which is the minimum sample that each node should have, and so on

                         ---------
                        | Outlook |
                         ---------
             [Overcast]/     |     \
                     /       |       \[Sunny]
                   /  [Rainy]|         \  
                 /           |           \
             -------      -------      -------
            |Child-1|    |Child-2|    |Child-3|
             -------      -------      -------
                |
                V
               Yes

2.3 Repeat the Steps 2.1 and 2.2 recursively until we get leaf nodes.

We can consider a node as a leaf node, when

we have all tuples in the node with the same target value, or
there are no remaining attribute, or
the tree we are building reaches its maximum depth, or
the count of the samples is less than the minimum sample count, or

we can also define many other termination criteria for leaf nodes.

After step 2, Our Decision Tree looks like:

                         ---------
                        | Outlook |
                         ---------
            [Overcast]/      |       \
                    /        |         \[Sunny]
                  /   [Rainy]|           \  
                /            |             \
           -----          -------        -------
          | Yes |        | Temp  |      | Windy |
           -----          -------        -------
                          / \             [f]/ \ [t]
                [<=23.75]/   \[>23.75]     /     \
                    -----     ----       -----   ----
                   | yes |   | No |     | Yes | | No |
                    -----     ----       -----   ----

Step 3. Start Making Predictions

Predict in the Decision Tree is simply to follow the path in the constructed tree from the root node to the leaf node by obeying decision rules at internal nodes. Once you arrive at the leaf node, return the value of that node.

Suppose, if we want to predict whether the players play or not, when the weather conditions are [Outlook=Rainy, Temp=35.5, Humidity=Normal, Windy=t]?

Outlook:
  Overcast - false
  Rainy - true
  --> Temp:
       35.5<=23.75 - false
       35.5>23.75 - true
       --> No (Predicted Value)

Our decision tree predicted that the players will not play in the given weather conditions.

Now, we have everything we need for this beautiful algorithm. Let’s put everything together!

Assemble Code Branches

Decision Tree Classification

Here’s the output:

------------------------Weather Dataset------------------------Decision Tree(depth = 3) : 
 {'Outlook': 
      {'Overcast': 'yes', 
       'Rainy': {'Temp': 
                     {'<=23.75': 'yes',
                     '>23.75': 'no'}
                },
       'Sunny': {'Windy': 
                     {'f': 'yes', 
                     't': 'no'}
                }
      }
}Train Accuracy: 100.0

Decision Tree Regression Algorithm

We can start by initiating a class and importing the dataset and necessary dependencies.

Then we can start defining the _find_best_splitmethod to show how the split is done:

In the above method, we try to find the best feature to split on and let the best split wins.

we use the method _find_feature_split to get the split score and cutoff value(for numerical columns) for each feature.

We will split on the feature column such that it has a low standard deviation. The method _find_score returns the sum of the weighted sum of standard deviations of the left split and the right split and we pick the cutoff value that gives us the minimum score.

Let’s look at the actual logic for building a Decision Tree!

After finding the best feature that split the dataset into subsets, we then find the best feature split recursively on each subset until we are left out with nodes that cannot be processed further.

Here, we are using max_depth_tree(maximum depth of the tree), min_sample_leaf(minimum number of samples in each node), minimum coefficient of variation as the termination conditions for a node to be a leaf node.

Finally, let's look at the predictions method:

Let’s put everything together

Assemble Code Branches

Decision Tree Regressor Algorithm

you can also get the complete code for both classification and regression fr om my Github repo.

Leaf Segment

Had fun in implementing Decision Tree from scratch? I hope this blog helps you understand how to build the decision tree classifier and regressor from scratch in python. If there is anything that is unclear, or I missed anything, or something was inaccurate, or if you any other feedback, please let me know. And, 💛 this if this was a good read.

Thanks for your time :)

Cheers !!

Happy Learning 😃

References

Decision tree

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences…

en.wikipedia.org

Building a Decision Tree from Scratch in Python

Build a better house price prediction model using a Decision Tree

levelup.gitconnected.com

Decision Tree — Implementation From Scratch in Python.

To make a decision … ask a tree.

Introduction

How does the tree make decisions?

Attribute Selection Measures

1. Information Gain(IG)

2. Gain Ratio

3. Gini Index

Decision Tree Classification algorithm

Step 1. We start by importing dataset and necessary dependencies

Step 2: Build the Decision Tree

Step 3. Start Making Predictions

Assemble Code Branches

Decision Tree Regression Algorithm

Assemble Code Branches

Leaf Segment

References

Decision tree

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences…

Building a Decision Tree from Scratch in Python

Build a better house price prediction model using a Decision Tree

Written by ranga_vamsi