Decision Tree Classifier from Scratch

Prudvi RajKumar
The Startup
Published in
8 min readMay 19, 2019

Decision Tree Binary classifier implementation from scratch in python

Complete working code can be found at https://github.com/prudvinit/MyML/blob/master/lib/tree/DecisionTreeClassifier.py

In this post, I’m going to write a Binary Decision tree Classifier from Scratch using python. I’m not going to use any external libraries like Scikit learn. This is a simple brute force implementation. I haven’t tried to Optimize it. The goal of this post is not to teach about Decision Trees as there are plenty of resources available online. I assume the reader is fairly familiar with Decision Trees. Here I’m only going to give a walk through of my implementation.

we need to have some data to get started. I’m going to use sample data that I’ve taken from https://medium.com/@rakendd/decision-tree-from-scratch-9e23bcfb4928.

dataset = {'Taste':['Salty','Spicy','Spicy','Spicy','Spicy','Sweet','Salty','Sweet','Spicy','Salty'],
'Temperature':['Hot','Hot','Hot','Cold','Hot','Cold','Cold','Hot','Cold','Hot'],
'Texture':['Soft','Soft','Hard','Hard','Hard','Soft','Soft','Soft','Soft','Hard'],
'Eat':[False, False, True, False, True, True, False, True, True, True]}
kids eating preferences

This is the dummy data that I’ve taken from other post. This data represents some kids dieting preferences. We have 10 kids preferences here. I’m going to build a Decision Tree classifier on this data to predict if a kid will eat a food with the given attributes (Taste, Temperature and Texture). Ouput attribute is ‘Eat’. We need to note that all the data in the data set is categorical. Decision Tree can also be built with continuous data. But , for the sake of simplicity, I’m taking all categorical attributes.

Let’s add all necessary import statements first

import pandas as pd
import numpy as np

A tree is built from nodes, right? So, let’s define our Node class first.

class Node:
def __init__(self):
self.split = None
self.feature_values = None
self.leaf = False
self.left = None
self.right = None
self.result = None
self.gini = 0
self.data = None
self.level=0

def __repr__(self):
return '['+self.split+' '+str(self.feature)+']'

There are quite a few number of attributes. Below is the explanation of each of the attribute

split : The feature on which the node is split on. Eg : Node can be split on Taste, Texture or Temperature

**feature_values : Values to be considered while splitting the node. Eg : If the Split is based on the feature ‘Taste’, feature_values can be [Sweet or Salty]. So, if the feature is ‘Taste’ and feature_values is [‘Sweet’,’Salty’] , then all the foods with taste either Sweet or Salty goes to the left subtree and all the foods with taste ‘Spicy’ goes to the right subtree.

leaf : This attribute tells if the node is a leaf node or not. A leaf node is the node which do not have any child nodes.

left and right : This represents left and right child nodes. Both are null for a leaf node

result : This tells if the output is true or false at this node. It’ll be used when predicting the output value for a new set of data

gini : In this tutorial, I’m using Gini Index to decide best split. This gives us value of gini index at this node.

data : Data at this node

level : Represents depth of the tree.

Now, let’s see the DecisionTree class

class DecisionTree:
#takes data and target column
def __init__(self,data,target='Eat',depth=None,v=0):
self.data = data
self.target = target
self.root = None
self.depth = depth
self.v = v

#This helper function calculates gini index of the dataset
def giniIndex(dset,target):
return 1-(len(dset[dset[target]==True])/len(dset))**2-(len(dset[dset[target]==False])/len(dset))**2

#This method builds the Decision tree
def build(self,data,level=0):
if self.v==1:
print('======================================')
print('Building Tree for the data')
print('======================================')
print(data)
left_dataset = None
right_dataset = None
min_gini = np.inf
best_feature = None
feature_values = None

gini = len(data)*DecisionTree.giniIndex(data,self.target)
if self.v==1:
print('GiniIndex for total data = ',gini)
node = Node()
node.data = data
node.level = level
node.gini = gini

if self.depth is not None and level==self.depth:
node.leaf = True
node.result = len(data[data[self.target]==True])>=len(data[data[self.target]==False])
return node

#If there is no impurity in this data, make it a leaf node
if gini==0:
if self.v==1:
print('The data is pure, no split is needed ')
node.gini=0
node.leaf = True
node.result = len(data[data[self.target]==True])>=len(data[data[self.target]==False])
return node


for feature in data.columns:
#We need not split on target column
if feature == self.target:
continue

#Find all unique values for the feauture
unique = data[feature].unique()
if self.v==1:
print('________________________________________________________')
print('Evaluating all possible splits for the feature "',feature,'"')
print('________________________________________________________')
print('All the values for this feature are ',unique)

#Initialize gini, left and right datasets and best feature values
tmngini = np.inf
tldset = None
trdset = None
tbftr = None
#We can't split based on a single value,There must be atleast 2 unique values to be able to split
if len(unique)==1:
print('Ignoring this feature as it has only a single unique value')
continue

#To find the best values for split on the given feature
for st in range(1,2**len(unique)-1):

lvals = [unique[x] for x in [t[0] for t in enumerate(list(bin(st)[2:])[::-1]) if t[1]=='1']]
#Find left data set
lset = data[data[feature].isin(lvals)]
rvals = list(set(unique)-set(lvals))
#Find right data set
rset = data[data[feature].isin(rvals)]
#Avoid dealing with duplicate sets
if len(lvals)>len(rvals):
continue
#Find gini index for left split
lgini = DecisionTree.giniIndex(lset,self.target)
#Find gini impurity for right split
rgini = DecisionTree.giniIndex(rset,self.target)
#Find the total weighted gini.
tgini = len(lset)*lgini+len(rset)*rgini
if self.v==1:
print(' ******************************************')
print(' ***** split based on values ',lvals,'*****')
print(' ******************************************')
print('-----------------------')
print('Left dataset')
print('-----------------------')
print(lset)
print('-----------------------')
print('Right dataset')
print('-----------------------')
print(rset)
print('Weighted Gini for this split ',tgini)
#Update minimum gini
if tgini<tmngini:
tmngini=tgini
tldset = lset
trdset = rset
tbftr = lvals
if self.v==1:
print('Best gini for ',feature,' is ',tmngini)

#Update minimum gini
if tmngini<min_gini:
min_gini = tmngini
left_dataset = tldset
right_dataset = trdset
feature_values = tbftr
best_feature = feature

#No improvement in gini value after split, Make it as leaf node
if min_gini>tmngini:
node.leaf = True
node.result = len(data[data[self.target]])>len(data[data[self.target]])
return node

node.min_gini= min_gini
node.feature_values = feature_values
node.split = best_feature
if self.v==1:
print('Best split is "',best_feature,'" values are ',feature_values,' and GiniIndex is ',min_gini)

#Build tree for left dataset
if left_dataset is not None:
node.left = self.build(left_dataset,level+1)

#Build tree for right dataset
if right_dataset is not None:
node.right = self.build(right_dataset,level+1)

#If both the trees are not built, it has to be leaf
if node.left==None and node.right==None:
node.leaf = True

return node


def fit(self):
self.root = self.build(data)
return self.root

def __predict__(self,s,root):
if root is None:
return False
if root.leaf:
return root.result
if s[root.split] in root.feature_values:
return self.__predict__(s,root.left)
return self.__predict__(s,root.right)


def predict(self,s):
return self.__predict__(s,self.root)

I’ll try to explain each and every method

def __init__(self,data,target='Eat',depth=None,v=0):
self.data = data
self.target = target
self.root = None
self.depth = depth
self.v = v

This is the constructor. We are initializing the tree with data and target variable. In this case, our target variable is ‘Eat’. depth is maximum allowed depth while building tree. This feature is used to avoid overfitting our tree. v stands for verbose. If we want to see the tree building process, we need to pass v=1.

def giniIndex(dset,target):
return 1-(len(dset[dset[target]==True])/len(dset))**2-(len(dset[dset[target]==False])/len(dset))**2

This is the helper function used to calculate the gini index for the data. We’ll use gini index to find the best split. Formula for gini index is

Formula (Image taken from http://www.learnbymarketing.com/481/decision-tree-flavors-gini-info-gain/)

C represents no. of classes. In our case C = 2 since it is a binary classifier.

=> next method is the tree building method. This method is bit large. I’ll give a high level walk through of this method.

left_dataset = None
right_dataset = None
min_gini = np.inf
best_feature = None
feature_values = None

Here I’m initializing variables for left dataset (left_dataset), right dataset (right_dataset) , best gini index (min_gini), best feature (best_feature), feature values (feature_values)

gini = len(data)*DecisionTree.giniIndex(data,self.target)

Find the gini index for the entire dataset.

        node = Node()
node.data = data
node.level = level
node.gini = gini

Create a new node, set data, level and gini index

if self.depth is not None and level==self.depth:
node.leaf = True
node.result = len(data[data[self.target]==True]) >= len(data[data[self.target]==False])
return node

If max depth is reached, then we need to make this node as the leaf. Here result is True if there are more True valued targets than False valued targets. Else false. This tells us the confidence of this leaf node. This will be used when predicting target for new data. While evaluating target value for a new data, the target value is same as the ‘result’ value of the node. If the ‘result’ is True, target is predicted as True. If the ‘result’ is False, target is predicted as False

#If there is no impurity in this data, make it a leaf node
if gini==0:
node.leaf = True
node.result = len(data[data[self.target]==True])>=len(data[data[self.target]==False])
return node

If gini index of data is 0, that means, data is pure. It has to be made as leaf node as we can’t split into two. And also, set the result to True or False based on the count of target values. If there are more True valued targets, result is set to True. If there are more False valued targets, result is set to False

========================================

I think , below one is the most complicated snippet of all

To evaluate the best split, we need to iterate through all features.

for feature in data.columns:
#We need not split on target column
if feature == self.target:
continue

If feature name is same as the target, we can skip it.

unique = data[feature].unique()

Find all the unique values for a feature. Eg : data[‘Tasty’].unique() gives [‘Salty’,’Spicy’,’Sweet’]. Now we need to evaluate which split will be the best by iterating through all the splits. There are 6 possible splits for Taste feature.

They are [‘Salty’],[‘Spicy’],[‘Sweet’],[‘Salt’ or ‘Spicy’],[‘Salt’ or ‘Sweet’], [‘Spicy’ or ‘Sweet’]. We need to evaluate these 6 splits and find the best one.

            #Initialize gini, left and right datasets and best feature values
tmngini = np.inf
tldset = None
trdset = None
tbftr = None

Initialize temporary variables to store gini index, left data set, right data set and feature values of best split

lvals = [unique[x] for x in [t[0] for t in enumerate(list(bin(st)[2:])[::-1]) if t[1]=='1']]
#Find left data set
lset = data[data[feature].isin(lvals)]
rvals = list(set(unique)-set(lvals))
#Find right data set
rset = data[data[feature].isin(rvals)]
#Avoid dealing with duplicate sets
if len(lvals)>len(rvals):
continue
#Find gini index for left split
lgini = DecisionTree.giniIndex(lset,self.target)
#Find gini impurity for right split
rgini = DecisionTree.giniIndex(rset,self.target)
#Find the total weighted gini.
tgini = len(lset)*lgini+len(rset)*rgini

For all the available feature value subsets, split the data into left data and right data. Find the weighted gini index by combining gini indices for both left and right subsets.

Find the feature subset that gives the minimum gini index. In the given example, Taste Feature with [‘Salty’] feature_value will give a best gini index of 1.0. By best, I mean the minimum gini index.

#Build tree for left dataset
if left_dataset is not None:
node.left = self.build(left_dataset,level+1)

#Build tree for right dataset
if right_dataset is not None:
node.right = self.build(right_dataset,level+1)

Now, repeat tree building process for left dataset and right data set recursively.

def __predict__(self,s,root):
if root is None:
return False
if root.leaf:
return root.result
if s[root.split] in root.feature_values:
return self.__predict__(s,root.left)
return self.__predict__(s,root.right)

This is the predict method that will predict target value for new data. we simply traverses from the root of the tree based on the split and feature_values. Once we hit the leaf node, ‘result’ variable of the leaf node will give our prediction

========================================

I tried to find the accuracy by submitting the same dataset that I trained the tree with (re-submission error). I got 100% percent accuracy, because I overfit the model by constructing the tree to the deepest leaf node. This clearly must not be done in practice. My only intention is to prove that tree has been build succesfully.

screenshot of the output

code link : https://github.com/prudvinit/MyML/blob/master/lib/tree/DecisionTreeClassifier.py

--

--