Decision Trees For Classification (ID3)| Machine Learning

Overview of Decision Trees and How to build One

Ashwin Prasad
Analytics Vidhya
9 min readJul 5, 2021

--

A Decision tree is a machine learning algorithm that can be used for both classification and regression ( In that case , It would be called Regression Trees ). This blog is concentrated on Decision trees for classification.

What is a Decision Tree ?

A Decision tree is similar to a computer science tree, with a hierarchical structure . It has nodes and these nodes are connected by edges. A decision tree classifies data by asking questions at each node. ( In a typical situation, If the answer is yes, go to the right child. If not , go to the left child ).

fig 1.1 : an example decision tree

fig 1.1 represents a simple decision tree that is used to for a classification task of whether a customer gets a loan or not. The input features are salary of the person, the number of children and the age of the person. The decision tree uses these attributes or features and asks the right questions at the right step or node so as to classify whether the loan can be provided to the person or not.

Terminologies

There are certain important terminologies that are to be understood before going further. These terminologies are specified with reference to fig 1.1.

Node: The blue coloured rectangles that are shown above are what we call the nodes of the tree. In a decision tree, a question is asked at each node and based on the answer, certain selected outcome is given.

Root Node or Root : In a decision tree, the top most node is called as the root node. In the above tree, the node that asks “age over 30 ?” is the root node.

Leaf node : Nodes that do not have any children are called leaf nodes. ( Get Loan, Don’t get Loan ). Leaf nodes hold the output labels.

The height of the above decision tree is 3 and in the top most level, we have the root nodes and in the bottom-most level, we have the leaf nodes. Even though all the leaf nodes are in the same level in fig 1.1, It might not always be the case. In the above decision tree, we have 2 children for each node. This is common but does not always hold out. If a particular categorical feature with more than 2 outcomes is chosen for a node to split the instances, The number of children for that node can also be more than 2.

Entropy and Information Gain

Before going into how to build a Decision Tree, we need to understand what is Entropy and Information Gain.

Entropy

It is a measure of impurity of a node. By Impurity, We mean to measure the heterogeneity at a particular node.
For Example:
Assume that we have 50 red balls and 50 blue balls in a Set. In this case , proportions of the balls of both the colours are equal. Hence, the entropy would be 1. which means that the set is impure.
But, If the set has 98 red balls and 2 blue balls instead of the 50–50 proportion (The same logic can be applied for a set of 98 blue balls and 2 red balls | which category does not matter. What matters is that one category dominates well over the other ) , then the entropy would be low (somewhere closer to 0). This is because now the set is mostly pure as it mostly contains balls belonging to one category. Because of this , the heterogeneity is reduced.

fig 2.1 : Entropy formula

fig 2.1 is a good function to represent entropy. C is the number of classes. Pi is the proportion of the ith class in that set. It has been shown that this function is good enough to represent the heterogeneity.

fig 2.2 : Entropy graph for 2 classes.

Let us take an example with 2 Classes. fig 2.2 represents the change in entropy as the proportion of the number of instances belonging to a particular class changes

Note: In a 2 class problem, The proportion of one class is always going to be 1 — (The proportion of the other class). So, proportion of one class is enough to draw this entropy graph.

Let’s assume that the x-axis of fig2.2 represent proportion of class A in a set. As it is closer to 0, we are sure that the other class ( Class B ) dominates the set. So, Homogeneity is maintained and the set is closer to pureness and hence the entropy is low. ( The same is the case for when proportion of Class A is closer to 1 ). But when both the classes almost have equal proportion of instances in a given set, the impurity is high and it is highly heterogeneous. Hence, the entropy is High.
Because this function satisfies all the cases, It can be used as a good measure for impurity for a given set.

Information Gain

It is not that the decision tree asks random questions at each node to come to the conclusion It comes. We ask the right questions at the right time to make sure that the entropy is reduced as much as possible at that particular level. This is done in-order to attain homogeneity at all the leaf nodes as fast as possible and come with a best classification tree that is also simple.

To put it simply, Our goal it to use the entropy to come up with a small and simple tree with a good classification accuracy. This is done by reducing the overall entropy at the next immediate step. In order to reduce that entropy, We need to understand which is the best attribute to use for questioning at a particular node. This is done using a measure called Information Gain.

fig 2.3 : Information Gain Formula

At a particular node, The information gain is calculated for all the features in the data-set to check which feature reduces the entropy at the next level more. This is done by subtracting the entropy of the parent to the weighted sum of the entropy of the next immediate children that we get if a particular feature is chosen.
The feature or attribute or the question that gets the highest information gain at a particular node is chosen as the question that splits the data at that current node.
This weight assigned to each children’s entropy is based on proportion of the instances in that child node. So, Without much effort, we can see that this formula does what it is meant to do.

Building a Decision Tree For Better Understanding

fig 3.1 : dataset

Now that the data-set is prepared, Let’s build the tree. As the first step , we need to choose the root node. So, The Information Gain for X, Y and Z need to be calculated.

Splitting On X

When we use X as the root node, This is what we get.

fig 3.2 : splitting on X

Splitting on Y

fig 3.3 : Splitting on Y

Splitting on Z

fig 3.4 : Splitting on Z

From all the above 3 splits , We can see that the information gain is maximum when we split the dataset using the feature Y. So, We can use Y at the root node. But, When we use Y at the root node, We can also see that both the leaf nodes have become pure leaf nodes (Only containing objects of one class. Either 0 or 1) . So, we can stop building the tree at this stage. The other features have become redundant as Y itself is able to classify the all the data perfectly.

Hence, The resultant Decision tree is the tree that is shown in fig 3.3. On Giving new data, This tree will check if the attribute Y is 0 or 1 for that data and will classify the new data according to it.

In case Y was not able to perfectly classify the data and one of the leaf nodes was impure, we can continue using the other 2 features on that node to further build the decision tree until we get pure leaf nodes.

Continuous Variables

So far, we have seen how decision trees can use entropy and information gain on categorical features in order to choose the best split. But, How does this work on continuous variables ?

fig 4.1: Sample Dataset

Let’s consider the data-set in fig 4.1 where X is a continuous variable and Y is the dependant variable.Here, certain points are chosen by decision trees to ask questions at it’s nodes.

Before choosing the points, the algorithm would order the data-set by X.
The points for the above example would be : (13+20)/2, (20+25)/2,(25+40)/2.
=> 16.5, 22.5, 32.5.

and In the root node , we evaluate which point should be used to split the data-set on, similar to how we use categorical features to split on.
For Example, should the data be split on (x≤16.5) or (x≤22.5) or (x≤32.5) or (x > 32.5) is chosen based on which split provides the highest information gain. In this example, It is not difficult to see that x≤22.5 would be chosen as it splits the dependant variable best

As we split the data based on all these intervals and calculate the information gain for all, this process becomes computationally costly when the data-set is huge.

fig 5.1: Decision Tree Example

In all the above examples, I have used custom datasets (from multiple sources) such that short decision tree could be made. This is for calculation purposes. So, Here’s an example of decision tree and it’s classification shown in 2-dimensional space.

When To Stop Building the Tree

  • When all the leaf nodes are pure ( leaf nodes have data that belong to one class )
  • When a certain criteria is met ( Such as the height of the tree exceeds certain limit, In which case the leaf nodes might not be pure but might output probability of a class instead of the class itself)
  • When all the features are used.

Conclusion

In this blog, we saw what is a decision tree and some other basic terminologies and how to build one, given the data-set and when to stop building the tree. Even though Decision Trees are powerful, They are prone to over-fitting, discussing which is a separate topic by itself.
Decision Tree is a powerful algorithm that can be used for classification and can be used for data with non-linear relationships. It is also an algorithm from which the getting the inference and the relationship between the variables is simpler compared to some other algorithms. Decision Trees also kinda does automatic feature selection ( provided some other conditions are satisfied ) as it uses information gain. Overall, It’s a great algorithm.

--

--

Ashwin Prasad
Analytics Vidhya

I write about things that intrigue me on any field of Computer Science, with more weightage to Machine Learning and Systems Programming