Decision Trees | Classification Intuition

Published in

Geek Culture

4 min readJun 16, 2021

Let’s learn more about a supervised learning algorithm today. First, we’ve Linear Regression which is all about Predicting a continuous numerical variable, Logistic Regression which can help us classify between two or more categorical variable. Then, we’ll learn about Decision Trees today, which has the capability of executing both types of tasks.

A Decision Tree is a tree-like model used for making decisions and their possible consequences. It is a Supervised Learning algorithm used for both Regression and Classification tasks and is easy to implement. This model/algorithm helps you in Decision making.

A Decision Tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities. It gives us a tree-like shape. A decision tree is a flowchart-like structure where:

Internal node (Non-leaf node) denotes a test on an attribute.
The branch represents the outcome of the test.
Leaf Node holds a class label.

Terminologies

Root Node: This node is the highest in the tree and has no parent.
Splitting: The process of dividing a node into two or more sub-nodes is known as Splitting.
Parent and Child Node: A node that gets divide into sub-nodes is known as a parent node of sub-nodes, whereas sub-nodes are the child of a parent node.
Branch/Sub-Tree: A subsection of the entire tree is known as a branch or sub-tree.
Decision Node: Splitting a sub-node into further sub-nodes is known as a Decision Node.
Terminal Node: Nodes that do not split are known as Leaf/Terminal/End Nodes.

Intuition

Decision Tree follows different types of algorithms while constructing a tree. In this blog, we’ll talk about the ID3 algorithm. We aim to get to the end node quickly. This algorithm asserts that we should select the proper attribute for splitting the decision tree in the first step. Now to select the first proper attribute to split the decision tree, Entropy comes into the picture. Entropy(H) helps us to measure the purity of the split. To get to the leaf node quickly, we have to select the proper parameters. The Splitting keeps on going on until and unless we get a pure subset (Either purely yes or purely no). More depth means more time consumption. The value of Entropy ranges between 0 and 1. The higher the Entropy, the harder it is to draw any conclusions from that information. A branch with an entropy of zero is a leaf node, and a branch with Entropy more than zero needs further Splitting.

How are we sure that Entropy is decreasing or is low? Here comes the topic of Information gain (IG). It is a statistical property that measures how well a given attribute separates the training examples according to their target classification. It computes the average of all the Entropy and is a decrease in Entropy. While splitting the Decision tree, it will take that particular split with a higher Information Gain.

So the algorithm is as follows:

It begins with the original set S as the root node, i.e. takes the complete training dataset as its root node.
Each iteration of the algorithm iterates through the very unused attribute of the set S and calculates the Entropy(H) and Information gain(IG) of this attribute.
It then selects the attribute which has the smallest Entropy and Largest Information gain.
The selected attribute then splits the set S to produce a subset of the data.
The algorithm continues to recur on each subset until it reaches the leaf node.

Assumptions

In the beginning, the whole training set is considered as the root.
Feature values are preferred to be categorical. If the values are continuous, then they are discretized before building the model.
Records are distributed recursively based on attribute values.
Order to place attributes as root or internal node of the tree is done using a statistical approach.

Advantages

Easy to understand and implement.
Less data cleaning required.
It can handle deal with both Regression and Classification problems.

Disadvantages

Overfitting: We are training our model to get good accuracy for training data, but it will not perform well whenever new test data comes.

This blog talked about the intuition behind Entropy and Information Gain in Decision Trees. In the following upcoming blogs, we’ll learn about Gini Impurity, which is better than Entropy, Decision Tree Pruning, regression approach and the effect of dirty data on Decision Tree.