ML Part13: Classification — Decision Tree

14 min readJun 19, 2024

Decision trees are a popular and intuitive method for both classification and regression tasks in machine learning. They model decisions and their possible consequences as a tree-like structure of nodes and branches.

What is a Decision Tree?

A decision tree is a flowchart-like structure where:

Internal nodes represent features (or attributes).
Branches represent decision rules.
Leaf nodes represent the outcome (or class labels).

In a classification context, the goal is to split the data into subsets that contain instances with similar values (homogeneity) for the target variable.

Components of a Decision Tree

Root Node: The top node that represents the entire dataset. This node is split into two or more homogeneous sets.
Internal Nodes: These nodes represent the features of the dataset and are used to make decisions.
Leaf Nodes: These are the terminal nodes that represent the final class label or outcome.
Branches: These connect nodes and represent the decision rules that split the data.

How Decision Trees Work

Selecting the Best Feature to Split: The tree-building process involves choosing the best feature to split the data at each step. Common criteria include:

Gini Impurity: Measures the frequency at which any element of the dataset would be misclassified.
Entropy (Information Gain): Measures the impurity of an arbitrary collection of examples.
Variance Reduction (for regression tasks).

2. Splitting the Data: The chosen feature divides the dataset into subsets. The goal is to maximize the homogeneity of the resulting subsets.

3. Recursive Splitting: This process is repeated recursively for each subset. At each step, the best feature is selected, and the data is split again.

4. Stopping Criteria: The recursion stops when one of the following conditions is met:

All instances in a subset belong to the same class.
There are no remaining features to split.
A predefined depth limit is reached.

5. Assigning Class Labels: Once the tree is fully built, each leaf node is assigned a class label based on the majority class of instances within that node.

Example

Consider a simple dataset with two features: “Outlook” and “Temperature”, and a binary target variable: “Play Tennis” (Yes/No).

Step-by-Step Construction:

Root Node Selection: Calculate the best feature to split the data (e.g., using Gini impurity or entropy).
First Split: Suppose “Outlook” is chosen. The data is split into three branches: “Sunny”, “Overcast”, and “Rainy”.
Second Split: For each subset (Sunny, Overcast, Rainy), choose the next best feature to split.
Stopping and Label Assignment: Continue until leaf nodes are pure (all instances have the same class) or another stopping criterion is met.

Advantages of Decision Trees

Interpretability:Easy to understand and visualize. The decision-making process is transparent and can be visualized as a tree.
No Feature Scaling Required: Decision trees do not require normalization or standardization of features.
Handle Both Numerical and Categorical Data: Can work with both types of data without much preprocessing.
Non-Parametric: They do not assume a particular distribution of data, making them versatile.

Disadvantages of Decision Trees

Overfitting: Decision trees can easily overfit, especially if they are too deep. This happens when the tree captures noise in the data rather than the underlying pattern.
Unstable: Small changes in the data can result in a completely different tree structure.
Bias Towards Features with More Levels: Features with many levels might dominate the splits.
Complexity: Large trees can become very complex and difficult to interpret.

Mitigating Disadvantages

Pruning: Reducing the size of the tree by removing nodes that provide little power in predicting the target variable can help combat overfitting.
Ensemble Methods: Techniques like Random Forests and Gradient Boosting Trees combine multiple decision trees to improve performance and reduce overfitting.

Finding the best split of data for a decision tree

Finding the best split of data for a decision tree is crucial as it determines how well the tree can classify or predict outcomes. Different methods for finding the best split focus on maximizing the homogeneity of resulting nodes. Here are the most common methods, along with their pros, cons, and appropriate use cases:

1. Gini Impurity

Description: Measures the impurity of a node based on the probability of misclassifying a randomly chosen element. It ranges from 0 (perfect purity) to 0.5 (maximum impurity for a binary classification).

Pros:

Computationally efficient, as it does not involve logarithms.
Suitable for binary and multi-class classification problems.

Cons:

Less interpretable compared to entropy-based methods.
Can be biased towards splits that create more branches.

When to Use:

When you need an efficient and straightforward measure for binary or multi-class classification.
Commonly used in implementations like the CART (Classification and Regression Trees) algorithm.

2. Entropy (Information Gain)

Description: Measures the impurity of a node using the concept of entropy from information theory. Information Gain is the reduction in entropy achieved by partitioning the data according to a given attribute.

Pros:

Provides a clear measure of purity.
Reduces bias towards attributes with many levels.

Cons:

Computationally more expensive due to logarithmic calculations.
Can sometimes be less effective on noisy data.

When to Use:

When you need a theoretically sound measure of impurity and have computational resources.
Commonly used in algorithms like ID3 and C4.5.

3. Gain Ratio

Description: Addresses the bias of Information Gain towards attributes with many values by normalizing the Information Gain with a Split Information value.

Pros:

Reduces the bias towards multi-valued attributes.
Provides a more balanced criterion for split selection.

Cons:

Can still be computationally intensive.
May favor attributes with moderate splits rather than large, pure splits.

When to Use:

When dealing with attributes with many distinct values.
Commonly used in C4.5 algorithm.

4. Chi-Square (Chi-Squared Test)

Description: Uses statistical chi-squared tests to determine if the observed frequency distribution of the data deviates from the expected distribution under the null hypothesis.

Pros:

Good for categorical data.
Provides a statistical measure for split effectiveness.

Cons:

Not suitable for continuous attributes without discretization.
Can be computationally expensive.

When to Use:

When dealing with categorical data and you want a statistical test for splits.
Commonly used in CHAID (Chi-squared Automatic Interaction Detection) algorithm.

5. Mean Squared Error (for Regression Trees)

Description: Measures the variance of the target variable within the nodes. The goal is to minimize the mean squared error (MSE) of the target variable within each node.

Pros:

Directly measures the variance, suitable for regression tasks.
Simple and interpretable.

Cons:

Sensitive to outliers.
Not suitable for classification problems.

When to Use:

Specifically for regression tasks where the goal is to predict continuous outcomes.
Commonly used in CART for regression trees.

6. Reduction in Variance (for Regression Trees)

Description: Similar to MSE, but focuses on the reduction of variance after a split.

Pros:

Direct measure of improvement after splitting.
Suitable for regression tasks.

Cons:

Also sensitive to outliers.
Less effective for classification tasks.

When to Use:

When you want to measure how much a split reduces overall variance in regression problems.

Choosing the Best Method

Gini Impurity and Entropy: Use when dealing with classification tasks. Gini is computationally simpler, while entropy provides a more theoretically sound measure.
Gain Ratio: Use when you have attributes with many distinct values to reduce bias.
Chi-Square: Use for categorical data in classification tasks, especially when a statistical basis for splits is desired.
Mean Squared Error and Reduction in Variance: Use for regression tasks to measure the quality of splits based on the reduction of variance or error.

Stopping criteria for decision trees

Stopping criteria for decision trees are essential to prevent overfitting, manage model complexity, and improve performance.

1. Maximum Depth

Description: Limit the depth of the tree to a maximum number of levels.

Pros:

Simple to implement.
Controls overfitting by limiting the complexity of the model.

Cons:

Choosing the optimal depth can be tricky and might require experimentation.
Too shallow a tree may underfit the data.

When to Use:

When you need to ensure the model is not overly complex.
When dealing with large datasets to reduce computational cost.
In practical applications, using cross-validation to find the optimal depth is recommended.

2. Minimum Samples per Leaf

Description: Set a minimum number of samples that a node must have to be considered for further splitting.

Pros:

Prevents splits that result in very small leaf nodes, reducing overfitting.
Helps in creating more balanced trees.

Cons:

Setting too high a threshold can lead to underfitting.
Needs tuning to find an appropriate balance.

When to Use:

When the dataset is noisy or when there are outliers.
Suitable for datasets with imbalanced classes.

3. Minimum Samples per Split

Description: Set a minimum number of samples that a node must have before it can be split.

Pros:

Similar to minimum samples per leaf but focuses on nodes eligible for splitting.
Reduces the risk of creating too many small nodes.

Cons:

May require parameter tuning to find the optimal value.
Can lead to underfitting if set too high.

When to Use:

When the dataset is large, and you want to ensure splits are meaningful.
When you want to control the growth of the tree more finely than with leaf size alone.

4. Minimum Information Gain / Minimum Decrease in Impurity

Description: Splitting only occurs if it results in a certain minimum improvement in the chosen impurity metric (e.g., Gini impurity, entropy).

Pros:

Directly controls the quality of splits.
Helps in ensuring only splits that significantly improve the model are made.

Cons:

Requires setting a threshold, which can be non-intuitive.
May require experimentation to find the best threshold value.

When to Use:

When you want to ensure that each split significantly contributes to the model’s performance.
Useful in reducing unnecessary splits, especially in noisy data.

5. Maximum Number of Nodes

Description: Limit the total number of nodes in the tree.

Pros:

Directly controls the size of the tree.
Ensures that the tree remains manageable and interpretable.

Cons:

Can be difficult to determine the optimal number of nodes.
May lead to suboptimal splits if not set appropriately.

When to Use:

When interpretability is crucial, and you need a simple tree structure.
When you need to control model size due to memory or processing constraints.

6. Early Stopping with Validation Set

Description: Use a validation set to monitor performance and stop splitting when performance on the validation set no longer improves.

Pros:

Data-driven method that adapts to the specific dataset.
Helps in avoiding overfitting by stopping at the optimal point.

Cons:

Requires holding out a validation set, which reduces the training data size.
May be computationally expensive due to repeated evaluations.

When to Use:

When you have enough data to spare a validation set.
In scenarios where model performance monitoring is feasible and desired.

Summary of When to Use Each Strategy

Maximum Depth: Use when you need a simple, interpretable model, and have limited computational resources.
Minimum Samples per Leaf: Use in noisy datasets to ensure each leaf has enough data for robust predictions.
Minimum Samples per Split: Use when you want to prevent splits that result in small child nodes and ensure meaningful splits.
Minimum Information Gain: Use when you want to focus on the quality of splits and ensure each split significantly improves the model.
Maximum Number of Nodes: Use when you need to explicitly control the size and complexity of the tree.
Early Stopping with Validation Set: Use when you have ample data and want a data-driven method to prevent overfitting.

Quality of a split at each node

In decision trees, various criteria can be used to evaluate the quality of a split at each node. Three common measures are classification error, entropy, and information gain. Here’s an explanation of each, along with their differences, pros, cons, and when to use them.

Classification Error

Description: Classification error measures the misclassification rate, i.e., the proportion of incorrectly classified instances in a node.

Error(t)=1−max(pi)

where pi is the proportion of instances in class i at node t.

Pros:

Simple and intuitive.
Easy to calculate and interpret.

Cons:

Less sensitive to changes in the node’s class distribution compared to entropy and Gini impurity.
Tends to be less effective in choosing informative splits.

When to Use:

In simple, intuitive models where interpretability is more important than fine-tuning performance.
For quick and rough estimates of node purity.

Entropy (Information Entropy)

Description: Entropy measures the impurity of a node by quantifying the unpredictability or disorder. It is calculated as:

Entropy(t)= −∑(pi)log2(pi)

where pi is the proportion of instances in class i at node t.

Information Gain: Information gain uses entropy to measure the effectiveness of a split. It is defined as the reduction in entropy before and after a split:

Information Gain=Entropy(parent)−∑k∣t∣∣tk∣Entropy(tk)

Pros:

Theoretically sound and widely used in information theory.
Effective in creating balanced trees by choosing informative splits.

Cons:

Computationally more expensive due to logarithmic calculations.
May require more computation for large datasets.

When to Use:

When a theoretically robust measure of impurity is needed.
Commonly used in algorithms like ID3 and C4.5.

Gini Impurity

Description: Gini impurity measures the impurity of a node by calculating the probability of misclassifying a randomly chosen element. It ranges from 0 (perfect purity) to 0.5 (maximum impurity for a binary classification).

Gini(t)=1−∑pi²

where pi is the proportion of instances in class iat node t.

Pros:

Computationally efficient, as it does not involve logarithms.
Suitable for binary and multi-class classification problems.
Often leads to similar results as entropy but with less computational cost.

Cons:

Less interpretable compared to entropy-based methods.
Can be biased towards splits that create more branches.

When to Use:

When you need an efficient and straightforward measure for binary or multi-class classification.
Commonly used in implementations like the CART (Classification and Regression Trees) algorithm.

Comparison: Classification Error vs. Entropy vs. Information Gain

Sensitivity to Node Distribution:

Classification Error is less sensitive to the distribution of classes within a node. It only considers the most frequent class and ignores the rest.
Entropy considers all classes and their proportions, making it more sensitive to changes in the distribution of classes within a node.

Calculation Complexity:

Classification Error is simple and quick to compute.
Entropy requires logarithmic calculations, making it more computationally intensive.

Effectiveness in Splitting:

Classification Error may lead to less effective splits as it does not capture the detailed class distribution.
Entropy (and the derived Information Gain) tends to be more effective in finding informative splits, leading to better-performing trees.

Example Comparison

Consider a node with the following class distribution:

Class A: 50 instances
Class B: 50 instances

Classification Error: Error=1−max⁡(0.5,0.5)= 1–0.5 = 0.5

Entropy: Entropy=−(0.5log⁡20.5+0.5log⁡20.5)=−(0.5×−1+0.5×−1)=1

If we split the node such that we now have:

Node 1:

Class A: 40 instances
Class B: 10 instances

Node 2:

Class A: 10 instances
Class B: 40 instances

Information Gain: Calculate entropy before the split: Entropy(before)=1

Calculate entropy after the split: Entropy(Node1)=−(0.8log⁡20.8+0.2log⁡20.2)=0.72

Entropy(Node 2)=−(0.2log⁡20.2+0.8log⁡20.8)=0.72

Weighted average entropy after the split: Entropy(after)=50100×0.72+50100×0.72=0.72

Information Gain: Information Gain=1−0.72=0.28

Gini Impurity: Calculate Gini impurity before the split:

Gini(before)=0.5

Calculate Gini impurity after the split:

Gini(Node 1)=1−(0.82+0.22)=1−0.64−0.04=0.32

Gini(Node 2)=1−(0.22+0.82)=1−0.04−0.64=0.32

Weighted average Gini impurity after the split: Gini(after)=50100×0.32+50100×0.32=0.32

Reduction in Gini impurity:

Reduction in Gini=0.5−0.32=0.18

Conclusion

Classification Error:

Pros: Simple, fast.
Cons: Less sensitive to class distribution, less effective in creating informative splits.

Entropy (Information Gain):

Pros: Sensitive to class distribution, effective in choosing informative splits.
Cons: Computationally more intensive, can be slower on large datasets.

Gini Impurity:

Pros: Efficient, computationally less expensive, effective for binary and multi-class classification.
Cons: Less interpretable than entropy, can be biased towards splits creating more branches.

When to Use:

Use classification error for quick, rough models where computational efficiency is critical.
Use entropy (information gain) when accuracy and the quality of splits are more important, and computational resources are available.
Gini Impurity: For an efficient and effective measure, especially when computational resources are limited, and you need a straightforward criterion.

Pros and Cons of Decision Tree

Pros of Decision Trees

Easy to Understand and Interpret:

Intuitive: The structure of decision trees is straightforward to understand and visualize, even for non-experts.
Transparency: The logic behind the decisions made by the tree can be easily followed and explained.

2. Non-linear Relationships:

Captures Complex Interactions: Decision trees can capture non-linear relationships between features and the target variable without requiring transformation of the data.

3. No Need for Data Normalization:

Preprocessing: Decision trees do not require data normalization or scaling, simplifying the preprocessing steps.

4. Feature Selection:

Intrinsic: Decision trees perform implicit feature selection, as they only split on the most informative features.

5. Handles Both Numerical and Categorical Data:

Versatile: Decision trees can handle both types of data, making them versatile for various types of datasets.

6. Resistant to Outliers:

Robustness: Decision trees are relatively resistant to outliers since splits are based on the majority of the data rather than extreme values.

7. Missing Values Handling:

Flexibility: Some implementations can handle missing values by assigning them to the most frequent class or using surrogate splits.

Cons of Decision Trees

Overfitting:

High Variance: Decision trees can easily overfit the training data, leading to poor generalization on unseen data. This is especially true for deep trees.

2. Instability:

Sensitivity to Data: Small changes in the data can result in completely different splits, leading to different tree structures.

3. Bias Towards Dominant Classes:

Imbalance: Decision trees can be biased towards classes that dominate the dataset. This can be mitigated using balanced criteria for splits.

4. Prone to Outliers:

Over-Sensitive: While relatively resistant, in some cases, decision trees can still be affected by outliers in the data.

5. Requires Pruning:

Complexity Control: To avoid overfitting, decision trees often require pruning, which involves removing parts of the tree that do not provide additional power in predicting target variables.

6. Limited to Axis-Parallel Splits:

Split Criteria: Decision trees split data based on axis-parallel splits (i.e., splits that are perpendicular to feature axes), which may not be optimal for some datasets with complex decision boundaries.

7. Computational Complexity:

Training Time: Training a decision tree can be computationally expensive, especially when dealing with a large number of features and instances.

8. No Guarantee of Global Optimal Solution:

Local Optima: The greedy algorithm used in decision trees to find splits does not guarantee a globally optimal solution. It only ensures a locally optimal split at each node.

9. Scalability Issues:

Large Datasets: Decision trees can become inefficient and slow to train on very large datasets.

Mitigating Cons

To mitigate some of the cons, ensemble methods like Random Forests or Gradient Boosting Trees are often used:

Random Forests: Mitigate overfitting and instability by averaging multiple decision trees.
Gradient Boosting Trees: Improve accuracy by sequentially building trees that correct errors of previous trees

ML Part13: Classification — Decision Tree

What is a Decision Tree?

Components of a Decision Tree

How Decision Trees Work

Example

Advantages of Decision Trees

Disadvantages of Decision Trees

Mitigating Disadvantages

Finding the best split of data for a decision tree

1. Gini Impurity

2. Entropy (Information Gain)

3. Gain Ratio

4. Chi-Square (Chi-Squared Test)

5. Mean Squared Error (for Regression Trees)

6. Reduction in Variance (for Regression Trees)

Choosing the Best Method

Stopping criteria for decision trees

1. Maximum Depth

2. Minimum Samples per Leaf

3. Minimum Samples per Split

4. Minimum Information Gain / Minimum Decrease in Impurity

5. Maximum Number of Nodes

6. Early Stopping with Validation Set

Summary of When to Use Each Strategy

Quality of a split at each node

Classification Error

Entropy (Information Entropy)

Gini Impurity

Comparison: Classification Error vs. Entropy vs. Information Gain

Example Comparison

Conclusion

Pros and Cons of Decision Tree

Pros of Decision Trees

Cons of Decision Trees

Mitigating Cons

Written by Avicsebooks