On Decision Trees and Entropy
A look into Decision Tree Algorithm & Entropy functions
In the realm of Predictive Analytics, Decision Trees is one of the algorithms that can be applied to both Regression & Classification tasks
The idea behind Decision Trees is to recursively build an upside-down tree-like structure with the features from the dataset, based on how they are contributing to the response variable at that point. With each iteration, the features will be selected in a way such that the resultant model minimizes the cost function.
The structure starts with the root node at the top, which then branches out & connects to other nodes, ultimately leading to the terminal nodes or leaves of the tree. Each node in the tree represents a feature; each link or branch represents a decision, & each leaf represents an outcome (category or continuous value of the response variable)
Pros & Cons
The simplicity behind Decision Trees is in the way the model is created by determining the most significant feature at any given point. Since it does not assume linear or any relationship between the variables, it is not restricted to only linearly or otherwise related variables — it can be applied to any set of data. In addition, extensive data manipulation is not required before applying Decision Trees, unlike many other algorithms
It is sometimes referred to as a greedy algorithm , because at every point it tries to minimize the cost function to the maximum possible extent. This excessive attempt at minimizing the cost function can lead to overfitting of the training data , thereby leading to high variance while predicting on the test data. Techniques such as Pruning or Bagging are often applied to account for this
Types of Decision Trees
Depending on the cost minimization technique being used, there can be many classifications of Decision Trees, a significant couple of which are:
- CART (Classification and Regression Tree) — Uses Gini Impurity measure to calculate the Information Gain at each iteration
- ID3 (Iterative Dichotomiser 3) — Uses Entropy function to calculate the Information Gain metric
Here, we will take a look into the Entropy function for ID3 Decision Trees & devise an algorithm to calculate the entropy for any iteration
Entropy & Information Gain
Entropy of each unique value for each feature is calculated as :
The Information Gain for the feature is then calculated as:
where, E(T) is the entropy of the response variable
Implementation
We will use the Balloons dataset from UCI data repository here. It represents different conditions of an experiment — to determine the response variable “Inflated” in terms of the 4 predictor features: color, size, act & age
# data = Balloons dataset
# N = Number of columns
# target = response variable
# en = Entropy of target variable
# cats = dictionary of counts of unique values for response variable
# vals = dictionary of counts of unique values for current featurefor i in range(0,N-1):
x=data.columns[i]
ig=0
for k, v in vals.items():
ent=0
for k1 in cats.keys():
n=data.loc[(data[target]==k1) & (data[x]==k), x].count()
prob = -(n/v) * np.log(n/v) #Calculating Probability
ent= ent + prob #Calculating Entropy
info = info + ((v/total)*ent) #Calculating Information
gain = en - ig # Calculating Information Gain
Behind the first iteration
Let’s take look at how Entropy & Information Gain gets calculated for the 1st iteration, using the above function
- Calculate Entropy & Information Gain w.r.t. “Inflated”
Column “color”:
‘YELLOW’: 32, ‘PURPLE’: 28
“Color” YELLOW with “Inflated” TRUE — 19
“Color” YELLOW with “Inflated” FALSE — 13
“Color” PURPLE with “Inflated” TRUE — 12
“Color” PURPLE with “Inflated” FALSE — 16
E(YELLOW) = (-19/32)*log(19/32) + (-13/32)*log(13/32) = 0.675
E(PURPLE) = (-12/28)*log(12/28) + (-16/28)*log(16/28) = 0.682
I(Color) = (32/60) * 0.675 + (28/60) * 0.682= 0.678
IG(Color) = I(Inflated) — I(Color) = 0.693–0.678= 0.0149 - Similarly, calculate Entropy & Information Gain for remaining columns:
IG(Size) = 0.0148
IG(Act) = 0.131
IG(Age) = 0.130 - Column”Act” is selected as the root node, since it has the highest Information Gain
Next Steps
The algorithm would then perform the following steps recursively to construct the Decision Tree (beyond the scope of this article):
- The feature with the highest Information Gain would be assigned as the node for that iteration
- The branches from that node would be formed by each unique value (condition/decision) that are possible from that node
- The branches would lead to other nodes, depending on the subsequent features & conditions
- In case there are no further possible features or conditions, leaf nodes would be created & no further branching would be done
In this way, the Decision Tree would be built recursively. The model can then be applied to predict values or categories for the response variable