Decision Tree Classifier from Scratch

Published in

The Startup

8 min readMay 19, 2019

Decision Tree Binary classifier implementation from scratch in python

Complete working code can be found at https://github.com/prudvinit/MyML/blob/master/lib/tree/DecisionTreeClassifier.py

In this post, I’m going to write a Binary Decision tree Classifier from Scratch using python. I’m not going to use any external libraries like Scikit learn. This is a simple brute force implementation. I haven’t tried to Optimize it. The goal of this post is not to teach about Decision Trees as there are plenty of resources available online. I assume the reader is fairly familiar with Decision Trees. Here I’m only going to give a walk through of my implementation.

we need to have some data to get started. I’m going to use sample data that I’ve taken from https://medium.com/@rakendd/decision-tree-from-scratch-9e23bcfb4928.

dataset = {'Taste':['Salty','Spicy','Spicy','Spicy','Spicy','Sweet','Salty','Sweet','Spicy','Salty'],
       'Temperature':['Hot','Hot','Hot','Cold','Hot','Cold','Cold','Hot','Cold','Hot'],
       'Texture':['Soft','Soft','Hard','Hard','Hard','Soft','Soft','Soft','Soft','Hard'],
       'Eat':[False, False, True, False, True, True, False, True, True, True]}

This is the dummy data that I’ve taken from other post. This data represents some kids dieting preferences. We have 10 kids preferences here. I’m going to build a Decision Tree classifier on this data to predict if a kid will eat a food with the given attributes (Taste, Temperature and Texture). Ouput attribute is ‘Eat’. We need to note that all the data in the data set is categorical. Decision Tree can also be built with continuous data. But , for the sake of simplicity, I’m taking all categorical attributes.

Let’s add all necessary import statements first

import pandas as pd
import numpy as np

A tree is built from nodes, right? So, let’s define our Node class first.

class Node:
    def __init__(self):
        self.split = None
        self.feature_values = None
        self.leaf = False
        self.left = None
        self.right = None
        self.result = None
        self.gini = 0
        self.data = None
        self.level=0
        
    def __repr__(self):
        return '['+self.split+'  '+str(self.feature)+']'

There are quite a few number of attributes. Below is the explanation of each of the attribute

split : The feature on which the node is split on. Eg : Node can be split on Taste, Texture or Temperature

**feature_values : Values to be considered while splitting the node. Eg : If the Split is based on the feature ‘Taste’, feature_values can be [Sweet or Salty]. So, if the feature is ‘Taste’ and feature_values is [‘Sweet’,’Salty’] , then all the foods with taste either Sweet or Salty goes to the left subtree and all the foods with taste ‘Spicy’ goes to the right subtree.

leaf : This attribute tells if the node is a leaf node or not. A leaf node is the node which do not have any child nodes.

left and right : This represents left and right child nodes. Both are null for a leaf node

result : This tells if the output is true or false at this node. It’ll be used when predicting the output value for a new set of data

gini : In this tutorial, I’m using Gini Index to decide best split. This gives us value of gini index at this node.

data : Data at this node

level : Represents depth of the tree.

Now, let’s see the DecisionTree class

class DecisionTree:
    #takes data and target column
    def __init__(self,data,target='Eat',depth=None,v=0):
        self.data = data
        self.target = target
        self.root = None
        self.depth = depth
        self.v = v
        
    #This helper function calculates gini index of the dataset
    def giniIndex(dset,target):
        return 1-(len(dset[dset[target]==True])/len(dset))**2-(len(dset[dset[target]==False])/len(dset))**2
     
    #This method builds the Decision tree
    def build(self,data,level=0):
        if self.v==1:
            print('======================================')
            print('Building Tree for the data')
            print('======================================')
            print(data)
        left_dataset = None
        right_dataset = None
        min_gini = np.inf
        best_feature = None
        feature_values = None
        
        gini = len(data)*DecisionTree.giniIndex(data,self.target)
        if self.v==1:
            print('GiniIndex for total data = ',gini)
        node = Node()
        node.data = data
        node.level = level
        node.gini = gini
        
        if self.depth is not None and level==self.depth:
            node.leaf = True
            node.result = len(data[data[self.target]==True])>=len(data[data[self.target]==False])
            return node
        
        #If there is no impurity in this data, make it a leaf node
        if gini==0:
            if self.v==1:
                print('The data is pure, no split is needed ')
            node.gini=0
            node.leaf = True
            node.result = len(data[data[self.target]==True])>=len(data[data[self.target]==False])
            return node
            
        
        for feature in data.columns:
            #We need not split on target column
            if feature == self.target:
                continue
            
            #Find all unique values for the feauture
            unique = data[feature].unique()
            if self.v==1:
                print('________________________________________________________')
                print('Evaluating all possible splits for the feature "',feature,'"')
                print('________________________________________________________')
                print('All the values for this feature are ',unique)
            
            #Initialize gini, left and right datasets and best feature values
            tmngini = np.inf
            tldset = None
            trdset = None
            tbftr = None#We can't split based on a single value,There must be atleast 2 unique values to be able to split
            if len(unique)==1:
                print('Ignoring this feature as it has only a single unique value')
                continue
            
            #To find the best values for split on the given feature
            for st in range(1,2**len(unique)-1):
                
                lvals = [unique[x] for x in [t[0] for t in enumerate(list(bin(st)[2:])[::-1]) if t[1]=='1']]
                #Find left data set
                lset = data[data[feature].isin(lvals)]
                rvals = list(set(unique)-set(lvals))
                #Find right data set
                rset = data[data[feature].isin(rvals)]
                #Avoid dealing with duplicate sets
                if len(lvals)>len(rvals):
                    continue
                #Find gini index for left split
                lgini = DecisionTree.giniIndex(lset,self.target)
                #Find gini impurity for right split
                rgini = DecisionTree.giniIndex(rset,self.target)
                #Find the total weighted gini. 
                tgini = len(lset)*lgini+len(rset)*rgini
                if self.v==1:                    
                    print(' ******************************************')
                    print(' ***** split based on values ',lvals,'*****')
                    print(' ******************************************')
                    print('-----------------------')
                    print('Left dataset')
                    print('-----------------------')
                    print(lset)
                    print('-----------------------')
                    print('Right dataset')
                    print('-----------------------')
                    print(rset)
                    print('Weighted Gini for this split ',tgini)
                #Update minimum gini
                if tgini<tmngini:
                    tmngini=tgini
                    tldset = lset
                    trdset = rset
                    tbftr = lvals
            if self.v==1:
                print('Best gini for ',feature,' is ',tmngini)
                
            #Update minimum gini
            if tmngini<min_gini:
                min_gini = tmngini
                left_dataset = tldset
                right_dataset = trdset
                feature_values = tbftr
                best_feature = feature
                
        #No improvement in gini value after split, Make it as leaf node
        if min_gini>tmngini:
            node.leaf = True
            node.result = len(data[data[self.target]])>len(data[data[self.target]])
            return node
            
        node.min_gini= min_gini
        node.feature_values = feature_values
        node.split = best_feature
        if self.v==1:
            print('Best split is "',best_feature,'" values are ',feature_values,' and GiniIndex is ',min_gini)
        
        #Build tree for left dataset
        if left_dataset is not None:
            node.left = self.build(left_dataset,level+1)
            
        #Build tree for right dataset
        if right_dataset is not None:
            node.right = self.build(right_dataset,level+1)
            
        #If both the trees are not built, it has to be leaf
        if node.left==None and node.right==None:
            node.leaf = True
        
        return node        
        
    
    def fit(self):
        self.root = self.build(data)
        return self.root
        
    def __predict__(self,s,root):
        if root is None:
            return False
        if root.leaf:
            return root.result
        if s[root.split] in root.feature_values:
            return self.__predict__(s,root.left)
        return self.__predict__(s,root.right)
        
        
    def predict(self,s):
        return self.__predict__(s,self.root)

I’ll try to explain each and every method

def __init__(self,data,target='Eat',depth=None,v=0):
    self.data = data
    self.target = target
    self.root = None
    self.depth = depth
    self.v = v

This is the constructor. We are initializing the tree with data and target variable. In this case, our target variable is ‘Eat’. depth is maximum allowed depth while building tree. This feature is used to avoid overfitting our tree. v stands for verbose. If we want to see the tree building process, we need to pass v=1.

def giniIndex(dset,target):
        return 1-(len(dset[dset[target]==True])/len(dset))**2-(len(dset[dset[target]==False])/len(dset))**2

This is the helper function used to calculate the gini index for the data. We’ll use gini index to find the best split. Formula for gini index is

Formula (Image taken from http://www.learnbymarketing.com/481/decision-tree-flavors-gini-info-gain/)

C represents no. of classes. In our case C = 2 since it is a binary classifier.

=> next method is the tree building method. This method is bit large. I’ll give a high level walk through of this method.

left_dataset = None
right_dataset = None
min_gini = np.inf
best_feature = None
feature_values = None

Here I’m initializing variables for left dataset (left_dataset), right dataset (right_dataset) , best gini index (min_gini), best feature (best_feature), feature values (feature_values)

gini = len(data)*DecisionTree.giniIndex(data,self.target)

Find the gini index for the entire dataset.

        node = Node()
        node.data = data
        node.level = level
        node.gini = gini

Create a new node, set data, level and gini index

if self.depth is not None and level==self.depth:
    node.leaf = True
    node.result = len(data[data[self.target]==True]) >= len(data[data[self.target]==False])
    return node

If max depth is reached, then we need to make this node as the leaf. Here result is True if there are more True valued targets than False valued targets. Else false. This tells us the confidence of this leaf node. This will be used when predicting target for new data. While evaluating target value for a new data, the target value is same as the ‘result’ value of the node. If the ‘result’ is True, target is predicted as True. If the ‘result’ is False, target is predicted as False

#If there is no impurity in this data, make it a leaf node
if gini==0:
    node.leaf = True
    node.result = len(data[data[self.target]==True])>=len(data[data[self.target]==False])
    return node

If gini index of data is 0, that means, data is pure. It has to be made as leaf node as we can’t split into two. And also, set the result to True or False based on the count of target values. If there are more True valued targets, result is set to True. If there are more False valued targets, result is set to False

========================================

I think , below one is the most complicated snippet of all

To evaluate the best split, we need to iterate through all features.

for feature in data.columns:
    #We need not split on target column
    if feature == self.target:
        continue

If feature name is same as the target, we can skip it.

unique = data[feature].unique()

Find all the unique values for a feature. Eg : data[‘Tasty’].unique() gives [‘Salty’,’Spicy’,’Sweet’]. Now we need to evaluate which split will be the best by iterating through all the splits. There are 6 possible splits for Taste feature.

They are [‘Salty’],[‘Spicy’],[‘Sweet’],[‘Salt’ or ‘Spicy’],[‘Salt’ or ‘Sweet’], [‘Spicy’ or ‘Sweet’]. We need to evaluate these 6 splits and find the best one.

            #Initialize gini, left and right datasets and best feature values
            tmngini = np.inf
            tldset = None
            trdset = None
            tbftr = None

Initialize temporary variables to store gini index, left data set, right data set and feature values of best split

lvals = [unique[x] for x in [t[0] for t in enumerate(list(bin(st)[2:])[::-1]) if t[1]=='1']]
#Find left data set
lset = data[data[feature].isin(lvals)]
rvals = list(set(unique)-set(lvals))
#Find right data set
rset = data[data[feature].isin(rvals)]
#Avoid dealing with duplicate sets
if len(lvals)>len(rvals):
 continue
#Find gini index for left split
lgini = DecisionTree.giniIndex(lset,self.target)
#Find gini impurity for right split
rgini = DecisionTree.giniIndex(rset,self.target)
#Find the total weighted gini. 
tgini = len(lset)*lgini+len(rset)*rgini

For all the available feature value subsets, split the data into left data and right data. Find the weighted gini index by combining gini indices for both left and right subsets.

Find the feature subset that gives the minimum gini index. In the given example, Taste Feature with [‘Salty’] feature_value will give a best gini index of 1.0. By best, I mean the minimum gini index.

#Build tree for left dataset
if left_dataset is not None:
 node.left = self.build(left_dataset,level+1)
            
#Build tree for right dataset
if right_dataset is not None:
 node.right = self.build(right_dataset,level+1)

Now, repeat tree building process for left dataset and right data set recursively.

def __predict__(self,s,root):
    if root is None:
      return False
    if root.leaf:
      return root.result
    if s[root.split] in root.feature_values:
       return self.__predict__(s,root.left)
    return self.__predict__(s,root.right)

This is the predict method that will predict target value for new data. we simply traverses from the root of the tree based on the split and feature_values. Once we hit the leaf node, ‘result’ variable of the leaf node will give our prediction

========================================

I tried to find the accuracy by submitting the same dataset that I trained the tree with (re-submission error). I got 100% percent accuracy, because I overfit the model by constructing the tree to the deepest leaf node. This clearly must not be done in practice. My only intention is to prove that tree has been build succesfully.

code link : https://github.com/prudvinit/MyML/blob/master/lib/tree/DecisionTreeClassifier.py

Decision Tree Classifier from Scratch

Written by Prudvi RajKumar