Mathematical Insights: Decoding the Inner Workings of Decision Trees for Classification Problems
Unveiling the Mathematical Framework: A Deep Dive into Decision Trees for Accurate Classification in Machine Learning
What is a Decision tree?
A decision tree is a supervised learning method with comparatively good accuracy and a self-explanatory model. The approach is Top-Down, and it is a Greedy algorithm. It can be applied for both Classification and Regression problems.
Terminologies for Decision Tree
- There are three types of nodes, namely:
- Root node: The topmost node consists of the whole data-set or a sample portion. This will split into two sub-nodes.
- Child node: The sub-nodes of the parent nodes.
- Terminal node: The bottom nodes which cannot further split.
2. Splitting: The nodes are split into sub-nodes using either the Gini index, Information Gain, or Chi-square methods, which will be discussed in the later sections.
Types of Decision Tree
Decision Trees work for two types of Input
1. Categorical Input
2. Continuous input variable
Based on the Target variables, there are two types of Decision Tree
1. Categorical Variable Decision Tree: The probability of a variable belonging to a category is calculated and assigned.
2. Continuous Variable Decision Tree: The nearest category for the variable is found and classified.
How to split the nodes for Classification problems ?
1. Gini Index and Gini Impurity
Gini index can be defined as the probability of a label being wrongly classified when it chosen randomly from a feature.
Gini Index = 1 — ∑(p)², where p is the label chosen from the feature.
Gini Index only does binary splits, the result is either Success or Failure.
Higher the value of the index higher the homogeneity. Hence the feature with the least Gini index is selected as the root node.
Gini impurity tells us the measure at which new random variable will be classified incorrectly, if it is classified in the same way as that of the distribution.
Gini impurity = Σ p(i) * ( 1 — p(i)), i is the range of numbers
Sounds pretty complicated, isn’t it!!
Let us understand with an example….
Gini index and Gini Impurity for Trading column
- P(Trading = High ) = 6/10
P(Trading = Low ) = 4/10 - Label ‘High’ in Trading column with respect to Target column.
For Increase label in Bitcoin values
P(Trading = High & Bitcoin values = Increase ) = 3/6 For Decrease label in Bitcoin values
P(Trading = High & Bitcoin values = Decrease ) = 3/6Gini Index for Label 'High' = 1 — [(3/6)²+ (3/6)²] = 0.5
3. Label ‘Low’ in Trading column with respect to Target column.
For Increase label in Bitcoin values
P(Trading = Low & Bitcoin values = Increase) = 3/4For Decrease label in Bitcoin values
P(Trading = Low & Bitcoin values = Decrease) = 1/4Gini Index for Label 'Low' = 1 — [(3/4)² + (1/4)²] = 0.375
Gini Index
of trading column
( 6/10 * 0.5 ) + ( 4/10 * 0.375 ) = 0.45
Gini Impurity
of trading column
[ (6/10) * (1–6/10) ] + [ (4/10) * (1–4/10) ] = 0.48
There is 48% chance of incorrect classification of the new random data on the Trading column.
Gini index and Gini Impurity for Trend column
- P(Trend = +) = 7/10
P(Trend = -) = 3/10 - Label ‘+’ in Trend column with respect to Target column
For Increase label in Bitcoin values
P(Trend = + & Bitcoin values = Increase ) = 4/7For Decrease label in Bitcoin values
P(Trend = + & Bitcoin values = Decrease ) = 3/7Gini Index for Label ‘+’ = 1 — [(4/7)²+ (3/7)²] = 0.492
3. Label ‘-’ in Trading column with respect to Target column
For Increase label in Bitcoin values
P(Trend = - & Bitcoin values = Increase ) = 2/3For Decrease label in Bitcoin values
P(Trend = - & Bitcoin values = Decrease ) = 1/3Gini Index for Label ‘-’ = 1 — [(2/3)² + (1/3)²] = 0.554
Gini Index
of Trend column
( 7/10 * 0.492 ) + ( 3/10 * 0.552 ) = 0.51
Gini Impurity
of Trend column
[ (7/10) * (1–7/10)] + [ (3/10) * (1–3/10)] = 0.42
There is 42% chance of incorrect classification of the new random data on the Trading column.
We should choose the column with minimum Gini Index. Hence Trading is the best option for the split.
2. Information Gain
Entropy is the amount heterogeneous data in a system.
- Completely homogeneous data has an entropy ∆E=0
- data-set with exactly half homogeneous data has an entropy ∆E=1.
Entropy of the target column is calculated in order to obtain the information gained from each feature.
Information gain = Entropy(target) — ∆ Entropy(features)
The feature with Higher information gain will be selected for the split.
Lets do some math to get a good understanding:
From the above data-set:
- Entropy(target)
P(Increase)= 6/10
P(Decrease)=4/10Entropy(Bitcoin values)= — [ (6/10) log(6/10) + (4/10) log(4/10)] = 0.292
2. Entropy(features)
- Trading
P(High) = 6/10
P(Low) = 4/10
Calculate the entropy of C1:
P(Increase) = 3/6 = 0.5
P(Decrease) = 3/6 = 0.5Entropy(C1) = — [0.5 log(0.5) + 0.5 log(0.5)] = 0.301
Calculate the entropy of C2:
P(Increase) = 3/4 = 0.75
P(Decrease) = 1/4 = 0.25Entropy(C1) = — [0.75 log(0.75) + 0.25 log(0.25)] = 0.243
Weighted Entropy(Trading) =[ (6/10) * 0.301 ] + [ (4/10) * 0.243 ] = 0.277
Information gain(Trading)
= Entropy(Bitcoin values) — Weighted Entropy(Trading)
= 0.292–0.277
= 0.0142
- Trend
P(+) = 6/10
P(-)= 4/10
Calculate the entropy of C3:
P(Increase) = 4/7 = 0.571
P(Decrease) = 3/7 = 0.428Entropy(C1) = — [0.571 log(0.571) + 0.428 log(0.428)] = 0.295
Calculate the entropy of C4:
P(Increase) = 2/3 = 0.667
P(Decrease) = 1/3 = 0.334Entropy(C1) = — [0.667 log(0.667) + 0.334 log(0.334)] = 0.276
Weighted Entropy(Trend) =[ (7/10) * 0.295 ] + [ (3/10) * 0.276 ] = 0.2893
Information gain(Trend)
= Entropy(Bitcoin values) — Weighted Entropy(Trend)
= 0.292–0.2893
= 0.0027
Information Gain is highest for Trading, hence it should be selected for the split.
3. Chi-Square
Chi-Square test acts as unit of measurement to find the significance of a feature. If the test value is higher then the statistical significance is high.
Chi-square = √ ((Actual — Predicted)² / Predicted)
It can perform multiple splits, unlike Gini index.
Here is an example:
Chi-square test in performed on the Recession column:
- Chi Square(Increase) for Past = √((3–2)²/2) = 0.707
- Chi Square(Decrease) for Present= √((1–1.5)²/1.5) = 0.407
Chi-Square is calculated in the same way for all the labels.
Chi-Square(Recession)= 0.707+0.707+0.407+0.407+0.407+0.407 = 3.042
Chi-square test on Trading:
Chi-Square(Recession) = 0 + 0 + 0.707 + 0.707 = 1.414
Chi-Square test on Trend:
Chi-Square(Recession) = 0.266 + 0.266 + 0.408 + 0.408 = 1.348
Recession feature has the highest Chi-Square test value, hence has the highest Statistical significance.
I’ll be posting the Decision Tree -Part 2 on splitting nodes for Regression problems.