Step by Step Approach to Building Classification and Regression Trees (CART)

Published in

The Startup

4 min readFeb 25, 2021

Classification and regression tree (CART) is a decision tree learning algorithm used for building predictive models from datasets. The models are built by recursively splitting the dataset with true or false questions which provides thresholds for the split. The splitting aims to produce a distribution of pure labels as the tree grows. We will explain what we mean by “pure labels” as we read on.

let’s dive in straight and check a step-by-step method to build a tree from this dataset.

1. Calculate the impurity of the dataset

A dataset is considered “impure” when its labels consist of different classes. The dataset presented above falls under this category(impure) because it consists of different types of fruits. To calculate the impurity, we use a metric called Gini. Gini ranges between 0 and 1; with zero considered “perfectly pure” and one “perfectly impure”.

A Gini of 0.64 should give you an idea of how impure the dataset is, given that Gini ranges between 0(perfectly pure) and 1(perfectly impure).

2. Generate True/False questions from the dataset

The next objective is to make the dataset “purer”. This objective is achieved by generating a list of true/false questions which will serve as thresholds to partition the dataset. To know what questions to ask, we iterate over every value for every feature of the dataset.

color == yellow? -“Is the color of the fruit yellow?”
diameter > = 3? -“Is the diameter of the fruit greater than or equal to 3?”
mass >= 65? -“Is the mass of the fruit greater than or equal to 65?”
color == red? -“Is the color of the fruit red?”
…

Observe that the questions were generated from values of features of the dataset. It worth reiterating that, questions should be generated from “every” value for “every” feature of the dataset. We should end up with 10 questions if every value is considered.

3. Decide which question to use first

Each question provides a threshold that can split the dataset. A question that fails to split the dataset is discarded. Let’s illustrate;

True/False question about the color of the fruits.

True/False question about the diameter of the fruits.

The decision on which of the generated questions to use first as a threshold is based on a concept called Information Gain. Information Gain gives us an idea about how much each question can reduce impurity. The best question to ask at a given node is the one with the highest Information Gain.

To calculate Information Gain, we would find the Gini impurity value of the parent dataset and the resulting child datasets after a split. Then, we would calculate the weighted average impurity for the child datasets. The difference between the impurity of the parent dataset and the weighted impurity of the child datasets is the Information Gain for that particular question. Let’s illustrate;

The question “color == yellow?” gives an Information Gain value of 0.17

As stated above, the best question to ask is the one with the “highest” Information Gain. Among the generated questions, four had an Information Gain (IG) of 0.37, one had IG of 0.17, three had IG of 0.14 and two questions were discarded because they failed to produce a split. We proceeded by picking any of the questions with an IG of 0.37 (Highest Information Gain) as our split threshold.

The question “mass > = 65?” gives an Information Gain value of 0.37

4. Recursively split the resulting datasets using Gini Information Gain metrics until no questions can be generated

Set C is considered “pure” since its labels consist of a single type of fruit(Grape). At this point, we add a leaf node at the branch’s tail.

We can observe that the mass values for the examples of set C are different. This implies, we could generate further questions to split the data; e.g “mass > = 10?” or “mass >= 7?”. It is worth remembering that, the primary objective of generating questions is to produce “pure labels”. Since this objective has been achieved, we end the splitting process on that branch and add a leaf node.

Set B is still “impure”. As a result, we must generate questions from its feature values to further split the dataset.

It would help a lot if you would replace the dataset or make modifications to it and then try solving it.

Hope this helps someone. You can check on the Google developers Channel on youtube (Josh Gordon, “Let's write a decision tree from scratch”) for more clarity. Thanks for reading!

Step by Step Approach to Building Classification and Regression Trees (CART)

1. Calculate the impurity of the dataset

2. Generate True/False questions from the dataset

3. Decide which question to use first

4. Recursively split the resulting datasets using Gini Information Gain metrics until no questions can be generated

Written by Bernard Baah Abboah