Building Decision Trees and its Math

Rakend Dubba
Jun 10, 2018 · 7 min read
Image for post
Image for post
Decision Tree from Scratch (Photo by Evie Shaffer on Unsplash)

A very basic introduction to Decision Trees and its Math.

What is covered :

  1. Decision Trees
  2. ID3 algorithm
  3. Mathematical calculations for calculating Entropy and Information Gain.

What is a Decision tree:

Decision trees are extensively used classifiers in data mining to predict the target variable for a given input variable.

It provides wonderful transparency for human interpretation by graphically representing the process of decision-making. Using this, we can predict both categorical variable (classification tree) and a continuous variable (regression tree).

It does this by creating a flow chart like structure, and an if-else condition is applied at the nodes on the attributes (if male or female?). From here the condition outcome is represented by branches and class label is represented by the leaf node. So, a rule is created at the decision nodes and a result is delivered at the leaf node. The entire flow is from root to leaf.

Image for post
Image for post
Decision Trees animation. (source )

As shown in the top right side, we divide the predictor space into N distinct and non-overlapping regions. For any test observation, if we want to predict the target variable, we will simply consider the mean of the training values that fall in this region. This is for the regression tree.

For classification tree, we take the mode.

Their types:

There is a wide variety of decision trees, such as ID3 (Iterative Dichotomiser 3), CART (Classification and Regression Tree), CHAID (Chi-squared Automatic Interaction Detector), etc. Not to get intimidated by their names, they’ve small variations, like one is the updated version of the other (ID3 and C4.5) and can handle numeric variables.

How does it works?

  1. Starts at the root node
  2. Splits data into groups (based on some criteria, we will see this later)
  3. Set a decision at node
  4. Move the data along the respective branches
  5. Repeat the process until a stopping criterion is met (max levels/depth reached, min samples left to split, nothing left to split, etc)

How to choose root node:

This is slightly different in regression and classification trees.

In regression trees, we chose a splitting point such that there is the greatest reduction in RSS (Residual Sum of Squares).

Or we can calculate standard deviation reduction of the feature with respect to the training data. Here YR1 and YR2 are mean responses of region 1 and 2. Once we train the tree, we predict the response for a test data using the mean of the training observations in that group.

In Classification trees:

We use Entropy and Information Gain (in ID3). Gini Index for classification in the CART.

Entropy (Shannon’s Entropy) quantifies the uncertainty of chaos in the group. Higher entropy means higher the disorder. It is denoted by H(x), where x is a vector with probabilities p1, p2, p3…..

From the adjacent figure, we can see that the entropy (uncertainty) is highest (1) when the probability is 0.5, i.e. 50–50 chances. And entropy is lowest when the probability is 0 or 1, i.e. there is no uncertainty or high chance of occurrence.

So, entropy is maximum if in a class there are an equal number of objects from different attributes (like the group has 50 cats and 50 dogs), and this is minimum if the node is pure (like the group has only 100 cats or only 100 dogs). We ultimately want to have minimum entropy for the tree, i.e. pure or uniform classes at the leaf nodes.

S — Current group for which we are interested in calculating entropy.

Pi — Probability of finding that system in ith state, or this turns to the proportion of a number of elements in that split group to the number of elements in the group before splitting(parent group).

In this classification tree, while splitting the tree we select those attributes that achieves the greatest reduction in entropy. Now, this reduction (or change) in entropy is measured by Information Gain


We consider some toy problem. The problem is about predicting whether some kid is going to eat a particular type of food given that kids prior eating habits.

Image for post
Image for post
Eating habits of a kid

(As a side note, there is no taste like spicy, but just let’s consider that in this stupid kid’s bizarre eating habits preferences. 😀 )


From the above chart, we can see that the food preferences Taste, Temperature and Texture are exploratory variables and Eat (Yes/No) is target variable.

Now, we need to construct a top-down decision tree that splits the dataset and finally form a pure group, so we can predict for a new test variable if the kid eats or not.

We are going to use the ID3 algorithm for this. First, we look at the math involved in this, later we develop an algorithm from scratch.

The Math behind scenes:

Entropy of the total attribute at the root node

No. of ‘NO’ → 4

No. of ‘YES’ → 6

No. of objects → 10

For Taste:

Ni→No. of ith elements in group

N →Total no. of elements in group

Image for post
Image for post

For Temperature:

For E_Cold, same number of ‘YES’ and ‘NO’ are there. So, entropy will be very high → 1.

For Texture:

Image for post
Image for post

So, after our calculations, it seems that we should split the data based on Taste, as it has highest information gain.

Image for post
Image for post

Now we need to find the attributes to split at second level nodes for the Salty and the Spicy branches.

Image for post
Image for post
Table after the Salty branch

For Temperature(2nd level) in the Salty branch:

Image for post
Image for post

For Texture (2nd level) in the Salty branch:

Image for post
Image for post

So, splitting based on texture sounds a good option, as it has higher information gain.

Image for post
Image for post
Image for post
Image for post
Table after the Spicy branch

For Temperature(2nd level) in the Spicy branch:

Image for post
Image for post

For Texture(2nd level) in the Spicy branch:

Image for post
Image for post

Both the attributes generated same Information Gain. So, we can split with any attribute.

We’ll choose Temperature to split.

Then the only option left is Texture to split.

Tables left after Temperature split, for both the branches after Texture split it will be pure homogeneous groups.

Image for post
Image for post
Table : Spicy-Temperature-Hot path
Image for post
Image for post
Table : Spicy-Temperature-Cold path
Image for post
Image for post

Finally, what we can conclude is, if the food is sweet the kid is not caring about its Temperature or Texture, he is eating.

If the food is Salty he is eating only if the texture is hard. And if the food is Spicy he eats if it is Hot and Hard or Cold and Soft. Bizarre Kid. 😈

Next post will write building this tree using Python. Stay tuned.
Thank you for your patience 😊.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store