Decision Trees Made Simple, Part II

Published in

The Startup

6 min readDec 16, 2020

Implementing decision trees with Python

Image by Javier Allegue Barros on Unsplash

Last time , we explored the basic math foundation behind the decision tree algorithm and we went over concepts such as entropy and information gain; and how these ideas tie together in the ID3 algorithm to form the basics of decision trees. Now , we’re going to review a simple python implementation that demonstrates how the algorithm works at a basic level solving a simple classification problem.

The Data

In order to be able to track the results of the algorithm is always best to have a small toy dataset that makes it easier to visualize and understand the algorithm output. For this exercise , I’m going to use a very simple dataset taken from the Fundamentals of Machine Learning book (2015) that is set up specifically to test and ID3 implementation

The goal here is to predict which type of vegetation are we going to observe (chapparal , riparian , conifer) based on certain descriptive features of the terrain (stream, slope, elevation). From this , we understand that our training variables are going to be STREAM , SLOPE , ELEVATION and VEGETATION is going to be our prediction target variable.

In python , there are several ways to load the data to be utilized by the algorithm. Since this is a very small dataset, we can split it into three distinctive python lists, for the sake of simplicity:

#Training variable names/tags
trainingTags=["Stream","Slope","Elevation"]
#Training variable data
trainingSet=[["false","true","true","false","false","true","true"],               ["steep","moderate","steep","steep","flat","steep","steep"], 
   ["high","low","medium","medium","high","highest","high"]]
#Target variable data
targetSet=["chapparal","riparian","riparian","chapparal","conifer",     "conifer","chapparal"]

Note that the trainingSet list consists in nested lists, each one of the nested lists corresponds to a column in the data table. Is important to realize that this is not the only alternative to store the training data, other options such as dictionaries or nested lists in which each list correspond to a row of the table instead of a column are possible. Depending on how the data is chosen to be stored, the implementation needs to be modified to ensure that the algorithm is accessing the observations correctly.

Designing the Solution

In the last entry, I passingly noted how the ID3 algorithm performs its classification tasks, although the procedure is relatively simple, it does require some effort in its implementation mainly to incorporate the entropy based dataset partition. To achieve this , I designed a python Class called DataSet that will handle most of the operations related to the partition and data reading , while having an external function that runs the ID3 procedure over the DataSet class. Dataset will , in turn , call three independent functions called entropy , rem_entropy and partition.

Let’s take a closer look to the partition functions , why do we 2 different functions for Entropy and Rem_entropy ?

#Calculates the entropy of a single feature 
def entropy (feature):
    frecuencies={}
    categories=[]
    prob_distribution={}
    feature_entropy=0
    #The feature is classfied according to its possible instances
    for i in feature:
        if i  not in  categories:
            categories.append(i)
            frecuencies[i]=1
            prob_distribution[i]=0
        else:
            frecuencies[i]+=1
    #We calculate the probability distribution of the feature 
    #and the feature entropy
    for i in frecuencies:
        prob_i=frecuencies[i]/len(feature)
        prob_distribution[i]=prob_i
        feature_entropy+=-(prob_distribution[i]*math.log(prob_distribution[i],2))
    return feature_entropy#Calculates the remaining entropy of a given descriptive feature
def rem_entropy(feature,partitions):
    categories=[]
    frequencies={}
    prob_distribution=[]
    for i in feature:
        if i not in categories:
            categories.append(i)
            frequencies[i]=1
        else:
            frequencies[i]+=1
    for i in frequencies:
        prob_distribution.append(frequencies[i]/len(feature))
    total_entropy=0
    for i in range(len(partitions)):
        total_entropy+=entropy(partitions[i])*prob_distribution[i]
    
    return total_entropy

The entropy function , calculates the total entropy of a single variable related to itself (no matter of if it is a training or a target variable), whereas the rem_entropy function is calculating the remaining entropy of a partitioned dataset. Recall that the remaining entropy is the result of substracting the entropy of a dataset that was partitioned using the value range of the training variables STREAM , SLOPE and ELEVATION. The method partition handles the process of generating partitioned datasets for all value levels.

#The partitions method will split the data set for each descriptive feature into
#subsets to be analyzed indivudally for entropy 
def partition(feature,target):
    categories=[]
    partitions=[]
    #we evaluate how many partitions can be based on the values of the 
    #descriptive feature, and then we create subsets of as many partitions as we need
    for i in feature:
        if i not in categories:
            categories.append(i)
    for i in categories:
        partitions.append([])        
    #We poputalte the subsets with the extracted data
    for i in range(len(feature)):
        partition_set=categories.index(feature[i])
        partitions[partition_set].append(target[i])
        
    return partitions

To better handle the partition data , I created an auxiliary class called partition that will store both the training and target data , along with the partition entropy and the target feature with which the partition was created.

class Partition():
    def __init__(self):
        self.Feature=""
        self.Training=[]
        self.Target=[]
        self.TrainingTags=[]
        self.Entropy=0
    def getTraining(self):
        return self.Training
    def getTarget(self):
        return self.Target
    def getTrainingTags(self):
        return self.TrainingTags
    def getFeature(self):
        return self.Feature

Combining all of the above features , the class dataset() handles all the behavior of selecting the feature that will be used to partition the dataset; based on the entropy information gain.

class DataSet():
    def __init__(self):
        self.Parent=None
        self.Partitions=[]
        self.Children={}
        self.TrainingSet=[]
        self.TargetData=[]
        self.TrainingTags=[]
        self.PartitionFeature=""
        self.leaf=False
        self.isEmpty=False
        self.classified=False
    def selectPartitionFeature(self,setEntropy):
        training=self.TrainingSet
        target=self.TargetData
        selectedIndex=0
        highestInfoGain=0
        for i in training:
            setPartition=partition(i,target)
            partitionEntropy=rem_entropy(i,setPartition)
            infoGain=setEntropy-partitionEntropy
            if infoGain>highestInfoGain:
                highestInfoGain=infoGain
                selectedIndex=training.index(i)
        self.PartitionFeature=self.TrainingTags[selectedIndex]
        return self.PartitionFeature
    
    def createChildren(self):
        for i in self.Partitions:
            child=DataSet()
            child.Parent=self
            child.TrainingSet=i.Training
            child.TargetData=i.Target
            child.TrainingTags=i.TrainingTags
            child.Feature=i.Feature
            self.Children[i.Feature]=child
        return self.Children
    def createPartitions(self):
        if self.PartitionFeature!="":
            partitionIndex=self.TrainingTags.index(self.PartitionFeature)
        else:
            partitionIndex=0
        categories=[]
        #Creates partition objects based on the number of instances 
        #in the selected feature 
        for i in self.TrainingSet[partitionIndex]:
            if i not in categories:
                categories.append(i)
                new_partition=Partition()
                new_partition.Feature=i
                self.Partitions.append(new_partition)
        #Populates the partitions training and target data based on the feature 
        #Variable subsets serves as storage for the partition subsets
        #appends individual lists for each feature in the training set 
        #not counting the feature already removed 
        featureData=self.TrainingSet[partitionIndex]
        filteredData=self.TrainingSet
        filteredTags=self.TrainingTags
        del filteredData[partitionIndex]
        del filteredTags[partitionIndex]
        
        for i in self.Partitions:
            i.TrainingTags=filteredTags
            feature=i.Feature
            k=0
            for m in filteredData:
                i.Training.append([])
                
            while k<len(featureData):
                if featureData[k]==feature:
                    for j in range(len(filteredData)):
                        trainingValue=filteredData[j][k]
                        i.Training[j].append(trainingValue)
                    targetValue=self.TargetData[k]
                    i.Target.append(targetValue)
                k+=1
        return self.Partitions

The method createChildren creates a tree structure with a reference to the unpartitioned dataset as Parent and each of the partitions as Children. This will facilitate the construction of the decision tree as it will be seen shortly.

ID3 Implementation

After designing the methods to partition any given dataset with the information gain criteria, all that remains is to recursively repeat the partition procedure until the the partitions reach their state of minimum entropy (ie: they are completely classified or as close as they can be)

def ID3(dataSet,setEntropy,Tags,tree=None):
    dataSet.selectPartitionFeature(setEntropy)
    node=dataSet.PartitionFeature
    dataSet.createPartitions()
    children=dataSet.createChildren()
    if tree is None:
        tree={}
        tree[node]={}
    for i in children:
        if len (children[i].TargetData)==1:
               tree[node][i]=children[i].TargetData
        else:
            children[i].TrainingTags=Tags 
            tree[node][i]=ID3(children[i],setEntropy,Tags)return tree

The ID3 will take a dataSet object, its entropy and variable names as inputs and then begin the process of recursive partition. The recursive loop stops when a particular partition only has 1 Target data entry (meaning that the set is completely classified and that the child is a leaf node of the tree). If that’s not the case the function calls itself with the partition as the new dataset.

Final thoughts

In my opinion the real challenge of ID3’s implementation comes from the need of accurately partitioning the datasets and creating the parent child references that will allow the tree to be created. This implementation is also the simplest most rudimentary one for the algorithm.

Other variations of the algorithm may use a different feature selection metric instead of information gain, such as the Gini Index or Information Gain ratio. All of these metrics have drawbacks and advantages that should be considered when running a machine learning model to see which one gives the best prediction results. Furthermore, the algorithm’s behavior can be improved by using tree pruning techniques and producing ensemble models, a single run of ID3 is likely insufficient to solve a real world problem in a comprehensive manner.

Typically, decision trees are limited to handle categorical features with a finite value range, but they can also be utilized on continuous numerical variables, provided that the variable is preciously binned and converted into a categorical variable prior to running it through the algorithm.

Full implementation can be found here:

Rhaegar2012/Fundamentals-of-Machine-Learning

Basic Implementation of Machine Learning Algorithms GitHub is home to over 50 million developers working together to…

github.com