Machine Learning 101

Part 6: Decision Tree

4 min readJan 4, 2024

In the previous part — Part 5: Logistic Regression, we learned about Logistic Regression, its working, and different types of Logistic Regression using examples.

Let us understand what is a Decision Tree —

We can compare this to a normal tree with roots, leaves, and branches. Decision Tree is a Supervised Machine Learning algorithm used for both regression and mostly for binary classification problems. It is called Decision Tree because it is based on if-else decision-making.

Let’s understand the Decision Tree algorithm using an example —

Suppose we want to predict whether a person is going to accept a job offer or not. We have feature columns such as — is the office near home, is the salary between specific range (eg. 50,000 to 80,000 USD), and provides a travel (eg. Cab) facility or not.

A node represents a feature column or a category.
The topmost node, the salary feature, is the root node that represents the entire dataset.
The nodes where the tree is split and some decisions are made are called decision nodes (office nearby and travel facility feature).
The nodes where the decision ends and we get an output are called the leaf nodes.
When a node is split, we can see a small 3-node tree formed, called a subtree.

Now the question arises, on what basis do we select a feature as a root node or a decision node?

There are metrics such as Entropy and Information Gain which play a major role in selecting these nodes.

Entropy is a measure of impurity or randomness in the data. Suppose, a person wants to accept the job offer based on “near to home” feature. We have data for 10 people who made the decision based on this feature. If 8 out of 10 people accepted the offer and 2 people rejected the offer, we can clearly predict that the person might accept the job offer. If 6 out of 10 people accepted the offer and 4 did not, it is unclear and hard to predict the person’s decision to accept the offer.

The more clear the decision is, the lower the Entropy value. The node with the lowest value is considered the purest node.

Entropy mathematical equation —

Information Gain follows an iterative process. At first, it selects the feature as a root node that gives the most information (highest value) about the entire data or a category. At every iteration or step, it then selects a feature as a decision node with the next highest value for further splitting, until we arrive at an output.

Information Gain calculates the change in entropy. It measures the reduction in impurity and randomness if we select a feature as a root node or a decision node.

Information Gain mathematical equation —

Image source — Information gain

Here E represents Entropy, Y represents the full dataset and X represents the feature node.

Major Problem with Decision Tree —

In the real world, there can be huge datasets with many features which may result in large decision trees. Larger decision trees result in overfitting where it gives you a great performance in making predictions for the training dataset but may give you a bad performance for the new data.

Methods that can help in preventing such problem —

1) Pruning is a method that removes the nodes with less significance i.e. node that gives less contribution in decision making.

2) Hyperparameters are the controlling knobs that are configured before training the model. Hyperparameter tuning is the process of configuring the hyperparameters to get the best model performance. In the Decision Tree algorithm, we have hyperparameters that can help prevent overfitting such as max_depth — which limits the depth of the Decision Tree, and max_features — which limits the maximum number of features that can be selected as nodes.

3) Ensemble Learning that aggregates the decisions from multiple base models (we are going to talk about it in detail in the upcoming part).

Stay tuned, in the next part we will understand, What is Ensemble Learning and Random Forest, and How it works. Please share your views, thoughts, and comments below. Feel free to ask any queries.

References:

Machine Learning 101

Part 6: Decision Tree

Written by Bzubeda