Classification in Decision Tree — A Step by Step CART (Classification And Regression Tree)

Arif Romadhan

Published in

Analytics Vidhya

5 min readApr 5, 2020

Decision Tree Algorithms — Part 2

1. Introduction

CART (Classification And Regression Tree) is a decision tree algorithm variation, in the previous article — The Basics of Decision Trees. Decision Trees is the non-parametric supervised learning approach. CART can be applied to both regression and classification problems[1].

As we know, data scientists often use decision trees to solve regression and classification problems and most of them use scikit-learn in decision tree implementation. Based on documentation, scikit-learn uses an optimised version of the CART algorithm

2. How Does CART work in Classification

in the previous article it was explained that CART uses Gini Impurity in the process of splitting the dataset into a decision tree.

Mathematically, we can write Gini Impurity as following

How does CART process the splitting of the dataset

This simulation uses a Heart Disease Data set with 303 rows and has 13 attributes. Target consist 138 value 0 and 165 value 1

In this simulation, only use the Sex, Fbs (fasting blood sugar), Exang (exercise induced angina), and target attributes.

Classification

Measure Gini Impurity in Sex

Measure Gini Impurity in Fbs (fasting blood sugar)

Measure Gini Impurity in Exang (exercise induced angina)

Fbs (fasting blood sugar) has the lowest Gini Impurity, so well use it at the Root Node

As we know, we have Fbs as Root Node, when we divide all of the patients using Fbs (fasting blood sugar), we end up with “Impure” leaf nodes. Each leaf contained with and without heart disease.

we need to figure how well Sex and Exang separate these patient in left node of Fbs

Exang (exercise induced angina) has the lowest Gini Impurity, we will use it at this node to separate patients.

In the left node of Exang (exercise induced angina), how well it separate these 49 patients (24 with heart disease and 25 without heart disease. Since only the attribute sex is left, we put sex attribute in the left node of Exang

As we can see, we have final leaf nodes on this branch, but why is the leaf node circled including the final node?

Note : the leaf node circled, 89% don’t have heart diseases

Do these new leaves separate patients better than what we had before ?

In order to answer those question, we must compare Gini Impurity using attribute sex and Gini Impurity before using attribute sex to separate patients.

The Gini Impurity before using sex to separate patients is lowest, so we don’t separate this node using Sex. The final leaf node on this branch of tree

Do the same thing on the right branch, so the end result of a tree in this case is

Main point when process the splitting of the dataset

1. calculate all of the Gini impurity score
2. compare the Gini impurity score, after n before using new attribute to separate data. If the node itself has the lowest score, than there is no point in separating the data
3. If separating the data result in an improvement, than pick the separation with the lowest impurity score