Detailed Explanation and math behind the building of Decision trees(learners at Medium level)
Before diving into Decision trees, you should be familiar with 2 things,
- what is Classification?
- what is Regression?
just go through my previous post for getting familiarity with these 2 things.
what is a decision tree?
A decision tree is a supervised machine learning algorithm, which follows a tree-like structure, that can be used for both classification and regression problems…
for now, I hope you know which category decision tree belongs and what it is…
As shown in the diagram above, the topmost node is the root node. Decision nodes are also called internal nodes or just ‘nodes’. Terminal nodes are also called leaf nodes.
Any decision tree will be built based on the ID3 algorithm. According to this algorithm, any decision tree built based on a Decrease in entropy and an increase in the informational gain(I’ll explain what it is).
But here, I am not explaining in the perspective of Information gain(With the formula of the I.G)and also from the perspective of CHAID trees(for more details about CHAID trees, go through this thing).
Note:- From here, Practical things will begin. keep a paper aside and follow every step that I am explaining and work it out on paper. if not, then you can’t get any use from this article. Follow each and every step thoroughly and don't miss anything. once again I am saying if you work it out on paper only you’ll get this Decision Tree. In this article, I just did some small amount of math and I left so much of math implementation for you to work out on paper. concentrate while you solving the math things.
let’s jump into the implementation now:
Here, I’ll explain it with an example, let’s consider that there are 4 categorical features(I am explaining this only for making you understand the concept.that’s why I am considering only categorical features that will be easy to understand. however, in the real world, you’ll not encounter dataset with only categorical features. So, I’ll explain how to deal with a dataset with a mix of categorical and numerical features in my future posts)which predict whether a person gets a bank loan or not.
let’s take features like Educated or not, Gender, Married, Self-employed…
so for understanding, just keep these things in mind:
Information gain:- Information gain is a thing that should be high for a feature to keep it at the top of the nodes while building a decision tree.you’ll get more information about Information gain here.
So for implementation, let’s consider-
- Educated(classes are Yes, No)
- married(classes are Yes, No)
- self-employed(classes are Yes, No)
- Gender(classes are Male, Female)
- will get a loan or not(classes are Yes, No)(our output)
Now from training data, divide for every feature class(Yes(or)No, Male(or)Female) of features, how many will get a loan and how many people don't get a loan.
Consider, Educated has 6 Yes’s and 6 No’s. For 6 Yes’s of Educated, consider there are 4 yes’s and 2 no’s of people getting a loan. For 6 No’s of Educated, consider there are 3 Yes’s and 3 No’s of people getting a loan.
Like this, divide every feature(Educated, Married, Self-employed, Gender) and also divide output(getting a loan or not)with respect to the feature(Educated, Married, Self-employed, Gender) selected.
when we do this we’ll get 4 things(Educated-output, Married-output, Self-employed-output, Gender-output)i.e., separations of every feature and also outputs with respect to features.
Now, we need to judge that in the above 4 things which feature will be the best feature for which information gain will be very high(entropy will be very low).
The thing which will measure here is impurity.
so for judging these things, we have 3 measures:
- Gini coefficient.
- Chi-square test. (it’s used for CHAID Decision trees)
Here, we’ll measure with one of the popular measure i.e., Gini
Let’s apply the below formula to the Educated-output diagram for 2 leafs..
Gini impurity formula is = 1 - (probability of ‘yes’)² - (probability of ‘no’)².
Here, you know that Yes and No belongs to the getting of loan or not.
So, if we calculate:
For the left leaf,
Gini = 1(4/(4+2))² — (2/(4+2))²
Calculate for right leaf also..
Now, calculate the weighted average of impurities of 2 leafs
W.A = (yes’s/ (yes’s + no’s))*(impurity of left leaf) + (no’s/ (yes’s + no’s))*(Impurity of right leaf)
So then you’ll get a value of W.A for educated-output.
Do the same procedure for remaining pairs also(Do it First).
For which pair the impurity will be low(Information Gain is high), that is the root node of the building decision tree.
Let’s think like you got impurity of educated-output is low when compared to other pairs.
Now the root node for the building decision tree is educated
You know that educated column has 6 Yes’s and 6 No’s
So that means in building a decision tree, it has left leaf(yes’s) has 6 and right lead(No’s) has 6.
Do the same above procedure from doing pairs(Educated-output, Married-output, Self-employed-output, Gender-output) to till here with the remaining pairs for left and right leaves of the root node(Educated) until when the impurity is lower when compared with impurity after dividing it…
This is the procedure for building the decision tree.
Code snippets for Decision tree:-
As said in the introduction, the Decision tree can work as both classifier and regressor. I am giving code snippets of how you can apply a decision tree on training and testing data.
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
y_pred = classifcation.predict(X_test)
from sklearn.tree import DecsionTreeRegresssor
regressor = DecisionTreeRegressor()
y_pred = classification.predict(X_test)
I hope you understood the core concept on building a Decision_tree.
I’ll explain the applying of a decision tree with a mix of categorical and numerical features in my future posts.
Happy Reading! ✌✌