On Decision Trees and Entropy

Published in

The Startup

5 min readJan 14, 2019

A look into Decision Tree Algorithm & Entropy functions

In the realm of Predictive Analytics, Decision Trees is one of the algorithms that can be applied to both Regression & Classification tasks

The idea behind Decision Trees is to recursively build an upside-down tree-like structure with the features from the dataset, based on how they are contributing to the response variable at that point. With each iteration, the features will be selected in a way such that the resultant model minimizes the cost function.

The structure starts with the root node at the top, which then branches out & connects to other nodes, ultimately leading to the terminal nodes or leaves of the tree. Each node in the tree represents a feature; each link or branch represents a decision, & each leaf represents an outcome (category or continuous value of the response variable)

Pros & Cons

The simplicity behind Decision Trees is in the way the model is created by determining the most significant feature at any given point. Since it does not assume linear or any relationship between the variables, it is not restricted to only linearly or otherwise related variables — it can be applied to any set of data. In addition, extensive data manipulation is not required before applying Decision Trees, unlike many other algorithms

It is sometimes referred to as a greedy algorithm , because at every point it tries to minimize the cost function to the maximum possible extent. This excessive attempt at minimizing the cost function can lead to overfitting of the training data , thereby leading to high variance while predicting on the test data. Techniques such as Pruning or Bagging are often applied to account for this

Types of Decision Trees

Depending on the cost minimization technique being used, there can be many classifications of Decision Trees, a significant couple of which are:

CART (Classification and Regression Tree) — Uses Gini Impurity measure to calculate the Information Gain at each iteration
ID3 (Iterative Dichotomiser 3) — Uses Entropy function to calculate the Information Gain metric

Here, we will take a look into the Entropy function for ID3 Decision Trees & devise an algorithm to calculate the entropy for any iteration

Entropy & Information Gain

Entropy of each unique value for each feature is calculated as :

The Information Gain for the feature is then calculated as:

where, E(T) is the entropy of the response variable

Implementation

We will use the Balloons dataset from UCI data repository here. It represents different conditions of an experiment — to determine the response variable “Inflated” in terms of the 4 predictor features: color, size, act & age

# data = Balloons dataset
# N = Number of columns
# target = response variable
# en = Entropy of target variable
# cats = dictionary of counts of unique values for response variable
# vals = dictionary of counts of unique values for current featurefor i in range(0,N-1):
    x=data.columns[i]
    ig=0
    for k, v in vals.items():
        ent=0
        
        for k1 in cats.keys():
            n=data.loc[(data[target]==k1) & (data[x]==k), x].count()
            prob = -(n/v) * np.log(n/v) #Calculating Probability
            ent= ent + prob             #Calculating Entropy
        info = info + ((v/total)*ent)   #Calculating Information
        gain = en - ig                # Calculating Information Gain

Behind the first iteration

Let’s take look at how Entropy & Information Gain gets calculated for the 1st iteration, using the above function

Calculate Entropy & Information Gain w.r.t. “Inflated”
Column “color”:
‘YELLOW’: 32, ‘PURPLE’: 28
“Color” YELLOW with “Inflated” TRUE — 19
“Color” YELLOW with “Inflated” FALSE — 13
“Color” PURPLE with “Inflated” TRUE — 12
“Color” PURPLE with “Inflated” FALSE — 16
E(YELLOW) = (-19/32)*log(19/32) + (-13/32)*log(13/32) = 0.675
E(PURPLE) = (-12/28)*log(12/28) + (-16/28)*log(16/28) = 0.682
I(Color) = (32/60) * 0.675 + (28/60) * 0.682= 0.678
IG(Color) = I(Inflated) — I(Color) = 0.693–0.678= 0.0149
Similarly, calculate Entropy & Information Gain for remaining columns:
IG(Size) = 0.0148
IG(Act) = 0.131
IG(Age) = 0.130
Column”Act” is selected as the root node, since it has the highest Information Gain

Next Steps

The algorithm would then perform the following steps recursively to construct the Decision Tree (beyond the scope of this article):

The feature with the highest Information Gain would be assigned as the node for that iteration
The branches from that node would be formed by each unique value (condition/decision) that are possible from that node
The branches would lead to other nodes, depending on the subsequent features & conditions
In case there are no further possible features or conditions, leaf nodes would be created & no further branching would be done

In this way, the Decision Tree would be built recursively. The model can then be applied to predict values or categories for the response variable