Classification in Decision Tree — A Step by Step CART (Classification And Regression Tree)

Arif Romadhan
Analytics Vidhya
Published in
5 min readApr 5, 2020

Decision Tree Algorithms — Part 2

1. Introduction

CART (Classification And Regression Tree) is a decision tree algorithm variation, in the previous article — The Basics of Decision Trees. Decision Trees is the non-parametric supervised learning approach. CART can be applied to both regression and classification problems[1].

As we know, data scientists often use decision trees to solve regression and classification problems and most of them use scikit-learn in decision tree implementation. Based on documentation, scikit-learn uses an optimised version of the CART algorithm

2. How Does CART work in Classification

in the previous article it was explained that CART uses Gini Impurity in the process of splitting the dataset into a decision tree.

Mathematically, we can write Gini Impurity as following

How does CART process the splitting of the dataset

This simulation uses a Heart Disease Data set with 303 rows and has 13 attributes. Target consist 138 value 0 and 165 value 1

In this simulation, only use the Sex, Fbs (fasting blood sugar), Exang (exercise induced angina), and target attributes.

Classification

Measure Gini Impurity in Sex

Measure Gini Impurity in Fbs (fasting blood sugar)

Measure Gini Impurity in Exang (exercise induced angina)

Fbs (fasting blood sugar) has the lowest Gini Impurity, so well use it at the Root Node

As we know, we have Fbs as Root Node, when we divide all of the patients using Fbs (fasting blood sugar), we end up with “Impure” leaf nodes. Each leaf contained with and without heart disease.

we need to figure how well Sex and Exang separate these patient in left node of Fbs

Exang (exercise induced angina) has the lowest Gini Impurity, we will use it at this node to separate patients.

In the left node of Exang (exercise induced angina), how well it separate these 49 patients (24 with heart disease and 25 without heart disease. Since only the attribute sex is left, we put sex attribute in the left node of Exang

As we can see, we have final leaf nodes on this branch, but why is the leaf node circled including the final node?

Note : the leaf node circled, 89% don’t have heart diseases

Do these new leaves separate patients better than what we had before ?

In order to answer those question, we must compare Gini Impurity using attribute sex and Gini Impurity before using attribute sex to separate patients.

The Gini Impurity before using sex to separate patients is lowest, so we don’t separate this node using Sex. The final leaf node on this branch of tree

Do the same thing on the right branch, so the end result of a tree in this case is

Main point when process the splitting of the dataset

1. calculate all of the Gini impurity score

2. compare the Gini impurity score, after n before using new attribute to separate data. If the node itself has the lowest score, than there is no point in separating the data

3. If separating the data result in an improvement, than pick the separation with the lowest impurity score

Bonus

How to calculate Gini Impurity in continuous data?

such as weight which is one of the attributes to determine heart disease, for example we have weight attribute

Step 1 : Order data by ascending

Step 2 : Calculate the average weight

Step 3 : Calculate Gini Impurity values for each average weight

The lowest Gini Impurity is Weight < 205, this is the cutoff and impurity value if used when we compare with another attribute

How to calculate Gini Impurity in categorical data?

we have a favorite color attribute to determine a person’s gender

In order to know Gini Impurity this attribute, calculate an impurity score for each one as well as each possible combination

Continue Learning — How Does CART work in Regression

Regression in Decision Tree — A Step by Step CART (Classification And Regression Tree) — Part 3)

About Me

I’m a Data Scientist, Focus on Machine Learning and Deep Learning. You can reach me from Medium and Linkedin

My Website : https://komuternak.com/

Reference

  1. https://medium.com/@arifromadhan19/the-basics-of-decision-trees-e5837cc2aba7
  2. Introduction to Statistical Learning
  3. Raschka, Sebastian. Python Machine Learning
  4. https://en.wikipedia.org/wiki/Decision_tree_learning
  5. Bonaccorso, Giuseppe. Machine Learning Algorithm
  6. Adapted from YouTube Channel of “StatQuest with Josh Stamer

--

--

Arif Romadhan
Analytics Vidhya

Data Scientist and Artificial Intelligence Enthusiast