Decision Trees Made Simple, Part II

Jose
The Startup
Published in
6 min readDec 16, 2020

Implementing decision trees with Python

Image by Javier Allegue Barros on Unsplash

Last time , we explored the basic math foundation behind the decision tree algorithm and we went over concepts such as entropy and information gain; and how these ideas tie together in the ID3 algorithm to form the basics of decision trees. Now , we’re going to review a simple python implementation that demonstrates how the algorithm works at a basic level solving a simple classification problem.

The Data

In order to be able to track the results of the algorithm is always best to have a small toy dataset that makes it easier to visualize and understand the algorithm output. For this exercise , I’m going to use a very simple dataset taken from the Fundamentals of Machine Learning book (2015) that is set up specifically to test and ID3 implementation

Implementation dataset

The goal here is to predict which type of vegetation are we going to observe (chapparal , riparian , conifer) based on certain descriptive features of the terrain (stream, slope, elevation). From this , we understand that our training variables are going to be STREAM , SLOPE , ELEVATION and VEGETATION is going to be our prediction target variable.

In python , there are several ways to load the data to be utilized by the algorithm. Since this is a very small dataset, we can split it into three distinctive python lists, for the sake of simplicity:

#Training variable names/tags
trainingTags=["Stream","Slope","Elevation"]
#Training variable data
trainingSet=[["false","true","true","false","false","true","true"], ["steep","moderate","steep","steep","flat","steep","steep"],
["high","low","medium","medium","high","highest","high"]]
#Target variable data
targetSet=["chapparal","riparian","riparian","chapparal","conifer", "conifer","chapparal"]

Note that the trainingSet list consists in nested lists, each one of the nested lists corresponds to a column in the data table. Is important to realize that this is not the only alternative to store the training data, other options such as dictionaries or nested lists in which each list correspond to a row of the table instead of a column are possible. Depending on how the data is chosen to be stored, the implementation needs to be modified to ensure that the algorithm is accessing the observations correctly.

Designing the Solution

In the last entry, I passingly noted how the ID3 algorithm performs its classification tasks, although the procedure is relatively simple, it does require some effort in its implementation mainly to incorporate the entropy based dataset partition. To achieve this , I designed a python Class called DataSet that will handle most of the operations related to the partition and data reading , while having an external function that runs the ID3 procedure over the DataSet class. Dataset will , in turn , call three independent functions called entropy , rem_entropy and partition.

Solution Architecture

Let’s take a closer look to the partition functions , why do we 2 different functions for Entropy and Rem_entropy ?

#Calculates the entropy of a single feature 
def entropy (feature):
frecuencies={}
categories=[]
prob_distribution={}
feature_entropy=0
#The feature is classfied according to its possible instances
for i in feature:
if i not in categories:
categories.append(i)
frecuencies[i]=1
prob_distribution[i]=0
else:
frecuencies[i]+=1
#We calculate the probability distribution of the feature
#and the feature entropy
for i in frecuencies:
prob_i=frecuencies[i]/len(feature)
prob_distribution[i]=prob_i
feature_entropy+=-(prob_distribution[i]*math.log(prob_distribution[i],2))
return feature_entropy
#Calculates the remaining entropy of a given descriptive feature
def rem_entropy(feature,partitions):
categories=[]
frequencies={}
prob_distribution=[]
for i in feature:
if i not in categories:
categories.append(i)
frequencies[i]=1
else:
frequencies[i]+=1
for i in frequencies:
prob_distribution.append(frequencies[i]/len(feature))
total_entropy=0
for i in range(len(partitions)):
total_entropy+=entropy(partitions[i])*prob_distribution[i]

return total_entropy

The entropy function , calculates the total entropy of a single variable related to itself (no matter of if it is a training or a target variable), whereas the rem_entropy function is calculating the remaining entropy of a partitioned dataset. Recall that the remaining entropy is the result of substracting the entropy of a dataset that was partitioned using the value range of the training variables STREAM , SLOPE and ELEVATION. The method partition handles the process of generating partitioned datasets for all value levels.

#The partitions method will split the data set for each descriptive feature into
#subsets to be analyzed indivudally for entropy
def partition(feature,target):
categories=[]
partitions=[]
#we evaluate how many partitions can be based on the values of the
#descriptive feature, and then we create subsets of as many partitions as we need
for i in feature:
if i not in categories:
categories.append(i)
for i in categories:
partitions.append([])
#We poputalte the subsets with the extracted data
for i in range(len(feature)):
partition_set=categories.index(feature[i])
partitions[partition_set].append(target[i])

return partitions

To better handle the partition data , I created an auxiliary class called partition that will store both the training and target data , along with the partition entropy and the target feature with which the partition was created.

class Partition():
def __init__(self):
self.Feature=""
self.Training=[]
self.Target=[]
self.TrainingTags=[]
self.Entropy=0
def getTraining(self):
return self.Training
def getTarget(self):
return self.Target
def getTrainingTags(self):
return self.TrainingTags
def getFeature(self):
return self.Feature

Combining all of the above features , the class dataset() handles all the behavior of selecting the feature that will be used to partition the dataset; based on the entropy information gain.

class DataSet():
def __init__(self):
self.Parent=None
self.Partitions=[]
self.Children={}
self.TrainingSet=[]
self.TargetData=[]
self.TrainingTags=[]
self.PartitionFeature=""
self.leaf=False
self.isEmpty=False
self.classified=False
def selectPartitionFeature(self,setEntropy):
training=self.TrainingSet
target=self.TargetData
selectedIndex=0
highestInfoGain=0
for i in training:
setPartition=partition(i,target)
partitionEntropy=rem_entropy(i,setPartition)
infoGain=setEntropy-partitionEntropy
if infoGain>highestInfoGain:
highestInfoGain=infoGain
selectedIndex=training.index(i)
self.PartitionFeature=self.TrainingTags[selectedIndex]
return self.PartitionFeature

def createChildren(self):
for i in self.Partitions:
child=DataSet()
child.Parent=self
child.TrainingSet=i.Training
child.TargetData=i.Target
child.TrainingTags=i.TrainingTags
child.Feature=i.Feature
self.Children[i.Feature]=child
return self.Children
def createPartitions(self):
if self.PartitionFeature!="":
partitionIndex=self.TrainingTags.index(self.PartitionFeature)
else:
partitionIndex=0
categories=[]
#Creates partition objects based on the number of instances
#in the selected feature
for i in self.TrainingSet[partitionIndex]:
if i not in categories:
categories.append(i)
new_partition=Partition()
new_partition.Feature=i
self.Partitions.append(new_partition)
#Populates the partitions training and target data based on the feature
#Variable subsets serves as storage for the partition subsets
#appends individual lists for each feature in the training set
#not counting the feature already removed
featureData=self.TrainingSet[partitionIndex]
filteredData=self.TrainingSet
filteredTags=self.TrainingTags
del filteredData[partitionIndex]
del filteredTags[partitionIndex]

for i in self.Partitions:
i.TrainingTags=filteredTags
feature=i.Feature
k=0
for m in filteredData:
i.Training.append([])

while k<len(featureData):
if featureData[k]==feature:
for j in range(len(filteredData)):
trainingValue=filteredData[j][k]
i.Training[j].append(trainingValue)
targetValue=self.TargetData[k]
i.Target.append(targetValue)
k+=1
return self.Partitions

The method createChildren creates a tree structure with a reference to the unpartitioned dataset as Parent and each of the partitions as Children. This will facilitate the construction of the decision tree as it will be seen shortly.

ID3 Implementation

After designing the methods to partition any given dataset with the information gain criteria, all that remains is to recursively repeat the partition procedure until the the partitions reach their state of minimum entropy (ie: they are completely classified or as close as they can be)

def ID3(dataSet,setEntropy,Tags,tree=None):
dataSet.selectPartitionFeature(setEntropy)
node=dataSet.PartitionFeature
dataSet.createPartitions()
children=dataSet.createChildren()
if tree is None:
tree={}
tree[node]={}
for i in children:
if len (children[i].TargetData)==1:
tree[node][i]=children[i].TargetData
else:
children[i].TrainingTags=Tags
tree[node][i]=ID3(children[i],setEntropy,Tags)
return tree

The ID3 will take a dataSet object, its entropy and variable names as inputs and then begin the process of recursive partition. The recursive loop stops when a particular partition only has 1 Target data entry (meaning that the set is completely classified and that the child is a leaf node of the tree). If that’s not the case the function calls itself with the partition as the new dataset.

Final thoughts

In my opinion the real challenge of ID3’s implementation comes from the need of accurately partitioning the datasets and creating the parent child references that will allow the tree to be created. This implementation is also the simplest most rudimentary one for the algorithm.

Other variations of the algorithm may use a different feature selection metric instead of information gain, such as the Gini Index or Information Gain ratio. All of these metrics have drawbacks and advantages that should be considered when running a machine learning model to see which one gives the best prediction results. Furthermore, the algorithm’s behavior can be improved by using tree pruning techniques and producing ensemble models, a single run of ID3 is likely insufficient to solve a real world problem in a comprehensive manner.

Typically, decision trees are limited to handle categorical features with a finite value range, but they can also be utilized on continuous numerical variables, provided that the variable is preciously binned and converted into a categorical variable prior to running it through the algorithm.

Full implementation can be found here:

--

--

Jose
The Startup

Engineer , Data and AI enthusiast . Amateur programmer