Understanding the Splitting Mechanism of Decision Trees in Machine Learning and their Relationship with Entropy in Thermodynamics
Did you know that Decision Tree, along with Bagging and Boosting, are some of the most widely used techniques in data science (exclude Deep Learning)? Decision Tree can handle both regression and classification problems, making it a versatile tool for data analysis.
At its core, a decision tree starts with the original data as the root node and grows by creating nodes to group similar outputs together. The algorithm uses measures such as Gini, Information Gain, and Chi-Square to determine the best way to split the variables. Let’s take a closer look at how this works for both classification and regression modeling.
In real-life situations, problems can be complex and classification may depend on multiple independent variables. For example, consider using Age, Sex, and Location to determine if someone has completed their graduation. While it might seem tempting to split the data based on Sex, we quickly realize that we need to further split the data on other variables to get a better classification.
As we continue to split the data, we aim to create “perfect splits” where each bucket contains only one type of output. This is called a stable state, where there is no further chance of becoming more stable. Once we find a perfect split, we no longer need to consider the other variables as one variable is enough to classify or predict the output.
By understanding the splitting criteria for decision trees, we can make more informed decisions about how to analyze and interpret our data
Splitting Criteria of decision tree:
The decision tree’s splitting criteria is a crucial aspect of data analysis that demands our full attention. Picture this: you have four variables that dictate the target variable, but it’s difficult to determine the output classification using just one variable. If there’s a variable that can perfectly classify the output, then other variables will be rendered useless. However, the real world is complex, and classification often depends on multiple independent variables.
For instance, let’s consider Age, Sex, and Location as independent variables with the output being whether a person has completed graduation or not. Initially, you might be tempted to split the data based on Sex. However, upon examination, you’ll realize that there are no two distinct buckets, where one bucket has those who completed graduation, and the other has those who didn’t. This lack of a perfect split means that further splitting on other variables is necessary for a better classification.
Next, let’s explore splitting the data based on Age. Suppose we find that if Age <22, no one has completed graduation, while if Age>22, everyone has completed graduation. This is what we call a perfect split, where each bucket has only one type of output. It’s a perfectly stable state with zero energy, meaning there’s no further chance of becoming more stable. With this perfect split, we don’t need to consider other variables for classifying or predicting the output since one variable is enough.
Understanding the decision tree’s splitting criteria is fundamental to unraveling complex data structures and making sound decisions based on data analysis.
First Split on Age:
we get Aged <28 has 60 people of which 40 are not married and 20 are married.
For Age => 28 has 40 people of which 30 are married and 10 are unmarried.
Why I have taken Age =28 as a boundary for distribution? Park this question for now.
Now it has released some energy for sure otherwise split cannot happen. Now if the energy level is reduced to 70%(imagine), exactly how much it is reduced will describe later. As the split is releasing energy, it is getting easier for us to predict/classify. You can say that probability of prediction of surety increases. From 0.5 to 0.67 (40/60) for one bucket and 0.5 to 0.75(30/40) for another one.
Again the Split happens on Sex/Location in one bucket and in another one it might be on sex or Location it depends on the datasets. These Splits again reduce the energy of different buckets further. Again the energy decreases further to 50%(say). If it splits further then surely the energy will decrease. When this splitting will stop, it can be controlled externally or if no constraint is given it split itself to such an extent that after that there is no benefit of the split. (reduction in Mix of output)
How to decide on which variable the split will happen first and in which order:
The first Split will happen on the variable which has the maximum capability of reducing energy/reducing variance/reducing the mix of output. But how much numerically it‘ll be decided by which has the maximum capability to reduce energy is based on 4 standard techniques.
If the target variable is categorical then methods are i) Gini ii) Information Gain iii) Chi-Square
If the target variable is continuous then the method is variance reduction.
Let’s discuss the Gini Index: This again will show numerically the reduction in energy/ reduction in the mix of output.
In this method we will calculate the Gini of child nodes after using each variable One by one, the split will happen on the variable which has the maximum weighted Gini index.
Formula :
Gini =p²+q²
P=Probability of a favorable outcome.
Q=Probability of a not favorable outcome.
Weighted: depends on After Split how much the data has in each bucket.
Let’s calculate Gini for our example:
Bucket 1: Age <28 has 60 people of which 40 are not married and 20 are married.
Bucket 2: Age => 28 has 40 people of which 30 are married and 10 are unmarried.
For Bucket 1 p=1/3 and q=2/3, Gini =1/9+4/9=5/9
Weight = 60/100=3/5
Weighted Gini = 3/5 * 5/9 =1/3
For Bucket 2: p= ¾, q=1/4
Gini =9/16 +1/16=10/16
Weight = 40/100=4/5
Weighted Gini =4/5 *10/16 =1/2
Weighted Gini for Split Age = 0.5 + 0.33=0.83
Similarly, weighted Gini is calculated for Split for every variable and the one that has the maximum weighted Gini will be the champion and will be allowed to Split the dataset first.
Information Gain: (concept of Entropy)
In thermodynamics Entropy means the randomness of a system, the more the Entropy the more is randomness. In a Decision, Tree Randomness means more variance in the target variable or the target variable has a better mix of different categories of levels/values. If after the Split a bucket has the same category of value/level then the randomness is zero and hence the Entropy is also Zero.
When the First Split starts to happen then there is Entropy says E, when the Split happens in two or more buckets (say two for simplicity) are E1 and E2. Then as per the condition of Split, this weighted (E1+E2 ) must be less than E, otherwise, there is no point of Split when the Entropy of the System Increases. Now, what is Information gain, Consider our example again where Target Variable is whether a person is married or not. We have 50 Married and 50 Unmarried, This is the case of maximum Entropy 1 because it’s an equal mix of both events. Any Split happens on Age, Sex or Location only happen if the Split bucket’s weighted Entropy sum is less than the parent Mode.
If the Entropy of the parent mode is maximum i.e., 1 then there is more chance of Split in which a good amount of Entropy can be reduced. This Difference of weighted Entropy between a parent node and the Sum of child Nodes is termed Information Gain. It signifies that we have gained this much amount of information with a particular Split. If the difference in Entropy is higher, we have gained more information.
Now Let’s look in terms Thermodynamics system when the Randomness (Entropy) E of a system before Split is E and when the content of the system is distributed in two individual systems with Entropies E1 and E2. If mE-(m1E1+m2E2)>0, then the randomness of two individual systems is less than the parent if the Entropy of a system is decreasing we are reaching a more stable state(less randomness) which is our goal and also true for data science tree-based modeling.
Let’s Calculate Information Gain in our Example:
Entropy of a node = -plog2p-qlog2q
Bucket 1: Age <28 has 60 people of which 40 are not married and 20 are married.
Bucket 2: Age => 28 has 40 people of which 30 are married and 10 are unmarried.
Entropy for Bucket1:
P=4/6 q=2/6
Entropy = -2/3*log2(2/3)-1/3*log2(1/3) =0.90
Weighted Entropy =60/100*(0.90)=0.54
Entropy for bucket 2:
P=3/4 , q=1/4
Entropy = -3/4*log2(3/4) -1/4*log2(1/4) =0.81
Weighted Entropy = 40/100*(0.81)=0.324
Sum of Entropies = 0.54+0.32=0.86
And Hence the Information Gain = Parent Node Entropy,1 (which is a perfect mix)-Sum of Entropies
Let’s calculate the Entropy of the Parent Node:
Parent Node: 50 Married and 50 Unmarried
P=50/100, q=50/100
Entropy = -0.5log2(0.5)-0.5log2(0.5)=1
The above two splits are for categorical target variables for continuous Target Variable the above two techniques don’t work. This is based on the Reduction of variance on the Target Variable.
Now consider the above example of Married and Un-Married and treat the target variable as continuous ie Married is having value is 1 and Un-Married has a value is 0.
Again consider the Same Split on Age and have the same buckets (nodes):
Bucket 1: Age <28 has 60 people of which 40 are not married and 20 are married.
Bucket 2: Age => 28 has 40 people of which 30 are married and 10 are unmarried.
Let’s Calculate the Variance of the Parent Node.
Mean of Parent Node = (50*1+50*0) /100 = 0.5
Variance = (50*(1–0.5)²+50*(0–0.5)²)/100
Variance=0.25
Now Variance of Bucket 1:
Mean = (40*0+20*1)/60=0.34
Variance=(40(0–0.34)²+20(1–0.34)²)/600=0.22
Weighted Variance= 60/100*(0.22)=0.1336
The variance of Bucket 2:
Mean = 30*1+10*0/40=0.75
Variance=(30*(1–0.75)²+10*(0–0.75)²)/40=0.1875
Weighted Variance=40/100 *(0.1875)=0.075
The sum of weighted Variance at each Node =0.1336+0.075=0.2086
The sum of the variance of the child node is less than the parent Node, this algorithm tests each variable and where the reduction of variable reduces maximum, the dataset will split at that node. The more the reduction of variable happens it will be going towards a stable system which is more desirable and reduces the randomness.
Some Important Points of the Decision Tree:
- When a Split occurs in a categorical variable, the subsequent splits cannot happen down the same path because all the levels of the variable have already been distributed into different nodes. Thus, the variable loses its potential for further splitting.
- If a Split occurs in a Continuous Variable, further splits can happen on the same variable but not consecutively. This is because the algorithm would have already made more nodes on the previous Split, and another Split on the same path would result in redundancy. Therefore, the algorithm will look for alternate branches to perform further Splits if required.
While Splitting on Continuous variable how the algorithm takes boundary to Split:
When a decision tree algorithm splits on a continuous variable, it searches for the best boundary at which to split. For example, if the variable is Age and the algorithm checks for the Gini value, Information Gain, reduction in Variance, or Chi-Square, it will choose the boundary value that satisfies all these criteria. In our example, the algorithm chose Age=28 as the best boundary for the split between married and unmarried individuals. If the algorithm finds another desirable point of division, such as Age=18, it will split the tree at both Age=28 and Age=18. It’s important to note that the algorithm avoids consecutive splits on the same continuous variable, as this would lead to overfitting.
Conclusion
Understanding the splitting mechanism of decision trees is essential for building accurate and efficient models in machine learning. By choosing the best variable to split on and the optimal boundary for the split, the algorithm can effectively reduce the energy or variance of the data and increase the accuracy of the predictions. Additionally, the concept of entropy in thermodynamics provides a useful analogy for understanding the decision tree algorithm and the process of reducing the mix of output. By applying these concepts, we can build powerful decision tree models that can effectively solve a wide range of prediction and classification problems.