Classification in Decision Tree — A Step by Step CART (Classification And Regression Tree)
Decision Tree Algorithms — Part 2
1. Introduction
CART (Classification And Regression Tree) is a decision tree algorithm variation, in the previous article — The Basics of Decision Trees. Decision Trees is the non-parametric supervised learning approach. CART can be applied to both regression and classification problems[1].
As we know, data scientists often use decision trees to solve regression and classification problems and most of them use scikit-learn in decision tree implementation. Based on documentation, scikit-learn uses an optimised version of the CART algorithm
2. How Does CART work in Classification
in the previous article it was explained that CART uses Gini Impurity in the process of splitting the dataset into a decision tree.
Mathematically, we can write Gini Impurity as following
How does CART process the splitting of the dataset
This simulation uses a Heart Disease Data set with 303 rows and has 13 attributes. Target consist 138 value 0 and 165 value 1
In this simulation, only use the Sex, Fbs (fasting blood sugar), Exang (exercise induced angina), and target attributes.
Classification
Measure Gini Impurity in Sex
Measure Gini Impurity in Fbs (fasting blood sugar)
Measure Gini Impurity in Exang (exercise induced angina)
Fbs (fasting blood sugar) has the lowest Gini Impurity, so well use it at the Root Node
As we know, we have Fbs as Root Node, when we divide all of the patients using Fbs (fasting blood sugar), we end up with “Impure” leaf nodes. Each leaf contained with and without heart disease.
we need to figure how well Sex and Exang separate these patient in left node of Fbs
Exang (exercise induced angina) has the lowest Gini Impurity, we will use it at this node to separate patients.
In the left node of Exang (exercise induced angina), how well it separate these 49 patients (24 with heart disease and 25 without heart disease. Since only the attribute sex is left, we put sex attribute in the left node of Exang
As we can see, we have final leaf nodes on this branch, but why is the leaf node circled including the final node?
Note : the leaf node circled, 89% don’t have heart diseases
Do these new leaves separate patients better than what we had before ?
In order to answer those question, we must compare Gini Impurity using attribute sex and Gini Impurity before using attribute sex to separate patients.
The Gini Impurity before using sex to separate patients is lowest, so we don’t separate this node using Sex. The final leaf node on this branch of tree
Do the same thing on the right branch, so the end result of a tree in this case is
Main point when process the splitting of the dataset
1. calculate all of the Gini impurity score
2. compare the Gini impurity score, after n before using new attribute to separate data. If the node itself has the lowest score, than there is no point in separating the data
3. If separating the data result in an improvement, than pick the separation with the lowest impurity score
Bonus
How to calculate Gini Impurity in continuous data?
such as weight which is one of the attributes to determine heart disease, for example we have weight attribute
Step 1 : Order data by ascending
Step 2 : Calculate the average weight
Step 3 : Calculate Gini Impurity values for each average weight
The lowest Gini Impurity is Weight < 205, this is the cutoff and impurity value if used when we compare with another attribute
How to calculate Gini Impurity in categorical data?
we have a favorite color attribute to determine a person’s gender
In order to know Gini Impurity this attribute, calculate an impurity score for each one as well as each possible combination
Continue Learning — How Does CART work in Regression
Regression in Decision Tree — A Step by Step CART (Classification And Regression Tree) — Part 3)
About Me
I’m a Data Scientist, Focus on Machine Learning and Deep Learning. You can reach me from Medium and Linkedin
My Website : https://komuternak.com/
Reference
- https://medium.com/@arifromadhan19/the-basics-of-decision-trees-e5837cc2aba7
- Introduction to Statistical Learning
- Raschka, Sebastian. Python Machine Learning
- https://en.wikipedia.org/wiki/Decision_tree_learning
- Bonaccorso, Giuseppe. Machine Learning Algorithm
- Adapted from YouTube Channel of “StatQuest with Josh Stamer”