A very basic introduction to Decision Trees and its Math.
What is covered :
- Decision Trees
- ID3 algorithm
- Mathematical calculations for calculating Entropy and Information Gain.
What is a Decision tree:
Decision trees are extensively used classifiers in data mining to predict the target variable for a given input variable.
It provides wonderful transparency for human interpretation by graphically representing the process of decision-making. Using this, we can predict both categorical variable (classification tree) and a continuous variable (regression tree).
It does this by creating a flow chart like structure, and an if-else condition is applied at the nodes on the attributes (if male or female?). From here the condition outcome is represented by branches and class label is represented by the leaf node. So, a rule is created at the decision nodes and a result is delivered at the leaf node. The entire flow is from root to leaf.
As shown in the top right side, we divide the predictor space into N distinct and non-overlapping regions. For any test observation, if we want to predict the target variable, we will simply consider the mean of the training values that fall in this region. This is for the regression tree.
For classification tree, we take the mode.
There is a wide variety of decision trees, such as ID3 (Iterative Dichotomiser 3), CART (Classification and Regression Tree), CHAID (Chi-squared Automatic Interaction Detector), etc. Not to get intimidated by their names, they’ve small variations, like one is the updated version of the other (ID3 and C4.5) and can handle numeric variables.
How does it works?
- Starts at the root node
- Splits data into groups (based on some criteria, we will see this later)
- Set a decision at node
- Move the data along the respective branches
- Repeat the process until a stopping criterion is met (max levels/depth reached, min samples left to split, nothing left to split, etc)
How to choose root node:
This is slightly different in regression and classification trees.
In regression trees, we chose a splitting point such that there is the greatest reduction in RSS (Residual Sum of Squares).
Or we can calculate standard deviation reduction of the feature with respect to the training data. Here YR1 and YR2 are mean responses of region 1 and 2. Once we train the tree, we predict the response for a test data using the mean of the training observations in that group.
In Classification trees:
We use Entropy and Information Gain (in ID3). Gini Index for classification in the CART.
Entropy (Shannon’s Entropy) quantifies the uncertainty of chaos in the group. Higher entropy means higher the disorder. It is denoted by H(x), where x is a vector with probabilities p1, p2, p3…..
From the adjacent figure, we can see that the entropy (uncertainty) is highest (1) when the probability is 0.5, i.e. 50–50 chances. And entropy is lowest when the probability is 0 or 1, i.e. there is no uncertainty or high chance of occurrence.
So, entropy is maximum if in a class there are an equal number of objects from different attributes (like the group has 50 cats and 50 dogs), and this is minimum if the node is pure (like the group has only 100 cats or only 100 dogs). We ultimately want to have minimum entropy for the tree, i.e. pure or uniform classes at the leaf nodes.
S — Current group for which we are interested in calculating entropy.
Pi — Probability of finding that system in ith state, or this turns to the proportion of a number of elements in that split group to the number of elements in the group before splitting(parent group).
In this classification tree, while splitting the tree we select those attributes that achieves the greatest reduction in entropy. Now, this reduction (or change) in entropy is measured by Information Gain
We consider some toy problem. The problem is about predicting whether some kid is going to eat a particular type of food given that kids prior eating habits.
(As a side note, there is no taste like spicy, but just let’s consider that in this stupid kid’s bizarre eating habits preferences. 😀 )
From the above chart, we can see that the food preferences Taste, Temperature and Texture are exploratory variables and Eat (Yes/No) is target variable.
Now, we need to construct a top-down decision tree that splits the dataset and finally form a pure group, so we can predict for a new test variable if the kid eats or not.
We are going to use the ID3 algorithm for this. First, we look at the math involved in this, later we develop an algorithm from scratch.
The Math behind scenes:
Entropy of the total attribute at the root node
No. of ‘NO’ → 4
No. of ‘YES’ → 6
No. of objects → 10
Ni→No. of ith elements in group
N →Total no. of elements in group
For E_Cold, same number of ‘YES’ and ‘NO’ are there. So, entropy will be very high → 1.
So, after our calculations, it seems that we should split the data based on Taste, as it has highest information gain.
Now we need to find the attributes to split at second level nodes for the Salty and the Spicy branches.
For Temperature(2nd level) in the Salty branch:
For Texture (2nd level) in the Salty branch:
So, splitting based on texture sounds a good option, as it has higher information gain.
For Temperature(2nd level) in the Spicy branch:
For Texture(2nd level) in the Spicy branch:
Both the attributes generated same Information Gain. So, we can split with any attribute.
We’ll choose Temperature to split.
Then the only option left is Texture to split.
Tables left after Temperature split, for both the branches after Texture split it will be pure homogeneous groups.
Finally, what we can conclude is, if the food is sweet the kid is not caring about its Temperature or Texture, he is eating.
If the food is Salty he is eating only if the texture is hard. And if the food is Spicy he eats if it is Hot and Hard or Cold and Soft. Bizarre Kid. 😈
Next post will write building this tree using Python. Stay tuned.
Thank you for your patience 😊.