Chapter 3 : Decision Tree Classifier — Theory

Published in

Machine Learning 101

6 min readMay 11, 2017

Welcome to third basic classification algorithm of supervised learning. Decision Trees. Like previous chapters (Chapter 1: Naive Bayes and Chapter 2: SVM Classifier), this chapter is also divided into two parts: theory and coding exercise.

In this part we shall discuss the theory and working behind decision trees. We shall see some mathematical aspects of the algorithm i.e. entropy and information gain. In second part we modify spam classification code for decision tree classifier in sklearn library. We shall compare the accuracy compared to Naive Bayes and SVM.

0. Motivation

Suppose we have following plot for two classes represented by black circle and blue squares. Is it possible to draw a single separation line ? Perhaps no.

Can you draw single division line for these classes?

We will need more than one line, to divide into classes. Something similar to following image:

We need two lines one for threshold of x and threshold for y.

We need two lines here one separating according to threshold value of x and other for threshold value of y.

As you have guessed now what decision tree tries to do.

Decision Tree Classifier, repetitively divides the working area(plot) into sub part by identifying lines. (repetitively because there may be two distant regions of same class divided by other as shown in image below).

So when does it terminate?

Either it has divided into classes that are pure (only containing members of single class )
Some criteria of classifier attributes are met.

We shall see these two points shortly.

In following sections we define few terms related to decision tree and then perform those calculation with sample example.

1. Impurity

In above division, we had clear separation of classes. But what if we had following case?

Impurity is when we have a traces of one class division into other. This can arise due to following reason

We run out of available features to divide the class upon.
We tolerate some percentage of impurity (we stop further division) for faster performance. (There is always trade off between accuracy and performance).

For example in second case we may stop our division when we have x number of fewer number of elements left. This is also known as gini impurity.

2. Entropy

Entropy is degree of randomness of elements or in other words it is measure of impurity. Mathematically, it can be calculated with the help of probability of the items as:

It is negative summation of probability times the log of probability of item x.

For example, 
if we have items as number of dice face occurrence in a throw event as 1123,
the entropy is
   p(1) = 0.5
   p(2) = 0.25
   p(3) = 0.25
entropy = - (0.5 * log(0.5)) - (0.25 * log(0.25)) -(0.25 * log(0.25)
        = 0.45

3. Information Gain

Suppose we have multiple features to divide the current working set. What feature should we select for division? Perhaps one that gives us less impurity.

Suppose we divide the classes into multiple branches as follows, the information gain at any node is defined as

Information Gain (n) =
  Entropy(x) — ([weighted average] * entropy(children for feature))

This need a bit explanation!

Suppose we have following class to work with intially

112234445

Suppose we divide them based on property: divisible by 2

Entropy at root level : 0.66
Entropy of left child : 0.45 , weighted value = (4/9) * 0.45 = 0.2
Entropy of right child: 0.29 , weighted value = (5/9) * 0.29 = 0.16Information Gain = 0.66 - [0.2 + 0.16] = 0.3

Check what information gain we get if we take decision as prime number instead of divide by 2. Which one is better for this case?

Decision tree at every stage selects the one that gives best information gain. When information gain is 0 means the feature does not divide the working set at all.

If you are enjoying this, do hit heart(❤ ) icon and spread the word.

Lets solve an example

Now that you know basic stuff about decision tree, lets solve example and look how it works.

Suppose we have a following data for playing a golf on various conditions.

Now if the weather condition is given as :

Outlook : Rainy, Temperature: Cool, Humidity: High, Windy: False

Should we play golf?

We have outcomes at beginning as NNYYYNYN (Y = Yes and N = No) taken in given order. Entropy at this root node is 0.3Now try to divide on various predictors outlook, temperature, humidity and Windy.
Calculate the information gain in each case. Which one has highest information gain?For example, if we divide based on Outlook, we have divisions as
 Rainy      : NNN      (entropy = 0)
 Sunny      : YYN      (entropy = 0.041)
 Overcast   : YY       (entropy = 0)So information gain = 0.3 - [0 + (3/8)*0.041 + 0]
                    = 0.28Try out for other cases.The information gain is max when divided based on Outlook.
Now the impurity for Rainy and Overcast is 0. We stop for them here.Next we need to separate Sunny,
If we divide by Windy, we get max information gain.
Sunny
 YYN
  Windy? Yes : N
         No  : YYSo decision tree look like something as shown in image below.No the prediction data is 
Outlook : Rainy, Temperature: Cool, Humidity: High, Windy: FalseFlowing down from tree according to result, we first check Rainy?
So answer is No, we don't play golf.

Final Thoughts

Dividing efficiently based on maximum information gain is key to decision tree classifier. However, in real world with millions of data dividing into pure class in practically not feasible (it may take longer training time) and so we stop at points in nodes of tree when fulfilled with certain parameters (for example impurity percentage). We shall see this in coding exercise.

In Next part, we shall code a decision tree classifier in Python using sklearn library. We shall tune some parameters to gain more accuracy by tolerating some impurity.

I hope that this section was helpful in understanding the working behind Decision tree classifier. Comment down your thoughts, feedback or suggestions if any below. If you liked this post, share it with your friends, subscribe to Machine Learning 101 click the heart(❤) icon. Follow me on Facebook, twitter, LinkedIn. Peace!