Decision Trees in Machine Learning: Splits, Pruning, and Mathematical Intuition

8 min readSep 17, 2024

In this article, we will explore the fundamental concepts behind Decision Trees, a popular and versatile machine learning algorithm used for both classification and regression tasks. We will begin by introducing the structure of Decision Trees and their working principle. Next, we’ll dive into the mathematical intuition of Decision Tree classifiers, exploring key concepts like Entropy, Gini Impurity, and Information Gain, which play a crucial role in deciding the best splits.

We’ll then cover how Decision Trees handle numerical features during the splitting process and the important concepts of pre-pruning and post-pruning, which help optimize tree growth and prevent overfitting. Finally, we will explore the mathematical intuition behind Decision Tree regressors, which apply similar principles for continuous target variables.

Decision Trees

Decision Trees are a crucial machine learning algorithm, forming the foundation for many ensemble models such as Random Forest and XGBoost. They are highly versatile, capable of solving both classification and regression problems.

There are two main types of decision trees:

ID3 (Iterative Dichotomiser 3): This method creates a standard decision tree.
CART (Classification and Regression Trees): Widely used in practice, particularly in libraries like Scikit-learn, CART creates a binary tree, meaning every decision node has at most two child nodes.

The decision tree algorithm operates similarly to a sequence of “if-else” conditions, systematically splitting the dataset to make predictions or classifications.

Mathematical Intuition of Decision Tree Classifier

Consider a play tennis dataset, where independent features like Outlook, Temperature, Humidity, and Wind are provided, and the goal is to predict whether a person will play tennis (Yes or No) based on new, unseen values of these independent features.

This is a binary classification problem, and we can solve it using a decision tree classifier.

For example, let’s consider Outlook as the root node, with its categories—Sunny, Overcast, and Rainy—as the child nodes:

For Sunny, there are 2 Yes outcomes and 3 No outcomes.
For Overcast, there are 4 Yes and 0 No outcomes.
For Rainy, there are 3 Yes and 2 No outcomes.

Since Overcast has only Yes outcomes, it's a pure split, so it becomes a leaf node. However, Sunny and Rainy are impure splits, meaning they will be further split until reaching leaf nodes (pure outcomes).

The decision between pure splits and impure splits is determined mathematically using concepts like Entropy and Gini Impurity. To decide which feature to split on, we use Information Gain, which helps identify the feature that results in the most significant reduction in impurity.

Entropy and Gini Impurity

In decision trees, to determine whether a split is pure or impure, we measure the impurity using Entropy and Gini Impurity. These metrics help us decide how “good” a split is and which feature to split on during the construction of the tree.

1.Entropy
Entropy measures the randomness or impurity in a dataset. The formula for Entropy is:

The range of entropy is:

0 means the split is pure (all data belongs to a single class).
1 means the split is maximally impure (equal distribution between classes).

The relationship between probabilities and entropy can be visualized as a curve. Entropy reaches its maximum value when the class distribution is even (i.e., 50% positive and 50% negative outcomes).

2.Gini Impurity

Gini Impurity is another metric used to measure how “mixed” the classes are in a node. The formula for Gini Impurity is:

For binary classification, it simplifies to:

0 indicates a pure split.
0.5 indicates an impure split, meaning both classes are equally mixed.

Gini Impurity is often preferred for larger datasets because it can be computed more efficiently than entropy. Entropy, on the other hand, is more commonly used for smaller datasets where computational cost is less of a concern.

Information Gain

Once we calculate the impurity for a potential split, we need to choose the feature that gives us the best split. This is done using Information Gain, which measures the reduction in impurity after a split.

For example, if we start with the feature Outlook in the Play Tennis dataset, we calculate the Information Gain for each feature and choose the one that results in the largest reduction in entropy.

The formula for Information Gain is:

The feature with the highest information gain is selected for the split.

Decision Tree Split for Numerical Features

In Decision Trees, splitting on numerical features follows a different approach compared to categorical features. The goal is to determine a threshold that best separates the target variable (e.g., Yes/No). The process is as follows:

Sort the Numerical Feature:
First, we sort the values of the numerical feature, say f1.
Generate Candidate Splits:
For every unique value in f1, we create a potential split (threshold) where all values less than or equal to the threshold are assigned to the left node, and the others are assigned to the right node.
Calculate Information Gain:
For each threshold, we calculate the Information Gain or Gini Impurity for the resulting split.
Select the Optimal Threshold:
The threshold that provides the maximum information gain (i.e., the best split) is selected, and the data is split accordingly.

However, this process can be computationally expensive for large datasets (e.g., millions of records) because we need to evaluate all possible thresholds.

Post-pruning and Pre-pruning

When a decision tree is built to its full depth, it often overfits to the training data. To combat overfitting, two techniques are used: post-pruning and pre-pruning.

1. Post-pruning (Cost Complexity Pruning)

Post-pruning involves first allowing the decision tree to grow fully, and then removing parts of the tree that do not improve its performance.

How it works:

Grow the Tree Fully:
The decision tree is initially constructed without any constraints, allowing it to overfit on the training data.
Evaluate Node Importance:
The tree is then evaluated to identify nodes and subtrees that do not contribute significantly to the accuracy of the model.
Prune Subtrees:
Nodes below a certain depth or that do not add significant value to the model are pruned, turning them into leaf nodes. For example, if a node has mostly positive outcomes (e.g., 90% “Yes” and 10% “No”), further splitting may not improve accuracy enough to justify the added complexity. In such cases, the subtree is pruned, leaving the current node as a leaf.
Simplify the Tree:
This reduces the complexity of the tree and lowers the risk of overfitting while maintaining a good balance of accuracy.

This method is especially useful for smaller datasets where overfitting is more common.

2. Pre-pruning

In pre-pruning, hyperparameters such as max_depth and max_features are set before the tree is fully constructed to limit its growth.

max_depth: Limits the maximum depth of the tree.
max_features: Restricts the number of features considered for splitting at each node.

This technique reduces the risk of overfitting by preventing the tree from growing too deep and capturing noise in the data.

Decision Tree Regressor

A decision tree regressor works similarly to how decision trees handle numerical features in classification, but with a focus on predicting continuous values for the dependent variable y based on the unseen values of independent variables such as f1 and f2.

How it works:

Splitting the Data:
For each value of f1, it is considered as a threshold. All the values of y corresponding to f1 values less than or equal to the threshold are placed in the left node, while those greater are placed in the right node.
Testing Different Splits:
For various threshold values of f1, different splits are created. However, unlike classification, information gain cannot be used for selecting the optimal split in regression.

Variance Reduction

To choose the best split in a decision tree regressor, we use variance reduction as a criterion to minimize the variance in the target variable y after splitting.

Variance:

Variance Reduction

The split with the maximum variance reduction is chosen to ensure the model makes more accurate predictions by minimizing the variance within each node.

Predicting Outcomes

For unseen values of f1 and f2, the decision tree regressor outputs the average of the y_i values at the leaf node where the new data point ends up after the splits.

This mechanism helps the model predict continuous variables effectively by reducing the variance at each step and simplifying the overall structure.

Conclusion

Decision Trees are a versatile and powerful machine learning algorithm capable of solving both classification and regression problems. By using a tree-like structure, decision trees systematically split data based on certain criteria, such as Entropy, Gini Impurity, or Variance Reduction, to make decisions. For classification tasks, the tree helps classify unseen data points based on learned splits, while for regression, it predicts continuous values by averaging the outcomes in the leaf nodes.

However, decision trees can often overfit, particularly when grown to maximum depth, making pruning techniques like post-pruning and pre-pruning essential for maintaining model generalization. Moreover, the algorithm has certain limitations in handling large datasets efficiently, which can be mitigated through advanced techniques or variants like Random Forests.

Understanding the inner workings of decision trees, from the mathematical intuition behind splits to the concepts of impurity and pruning, equips us with the knowledge needed to effectively apply this algorithm across a range of machine learning tasks.