Complete Math Intuition Behind Decision Tree

9 min readJul 2, 2024

Hello guys, hope all of you are doing well. In this particular blog I am going to explain about decision tree classifier and decision tree regressor. If you read this blog and understand the math behind it, I think this blog might be good enough to answer any question raised on decision tree. So, stay with me until the end of the blog…
Pre-requisites
Statistics and Machine Learning

Topics Covered

Decision Tree Classifier
Decision Tree Regressor

Decision Tree: It is a popular and intuitive machine learning algorithm used to solve both classification and regression problems. It operates by recursively partitioning the data into subparts based on the values of the different features. Each split in the tree is designed to maximize the purity of the resulting subsets, making it a powerful tool for both interpretability and predictive accuracy.

Decision tree classifier is a hierarchical structure where each node represents the decision or a feature test, each branch represents the outcome of that test, each leaf node represents a class label (for classification) or a continuous value (for regression).

i) Root Node:

The topmost node in the decision tree.
Represents the entire dataset before any split.
It performs the first split, based on the features that provides highest information gain (or lower Gini impurity).

ii) Decision Node (or Internal Node):

Node in a decision tree that perform a split on the feature.
Each decision node tests a specific feature and directs the flow of data to the child nodes based on the outcome of that test.
Internal nodes might have two or more child nodes.

iii) Leaf Node (or Terminal Node):

The endpoints of a decision tree.
Represents a final decision or prediction.
In classification tree, each leaf node assigns a class label based on the majority class of training samples that reach that node.
In regression tree, each leaf node assigns a continuous value (eg mean or median) based on the target values of training samples that reach that node.

Example:

Imagine you want to decide whether to play outside based on the weather. A decision tree helps you to make this decision by asking series of question about the weather conditions.

i) Starting Point: You have data about different days.

Is it sunny?
Is it rainy?
Is it Windy?

ii) Decision making process: You start with the first question. Is it sunny?

If yes, you might decide to play outside.
If no, you move to the next question.

Is it rainy?

If yes, you decide not to play outside.
If no, you might ask if it is windy.
Depending on the answer, you decide to play outside or not.

This sequence of questions or decisions form a decision tree. Each question (creates a new split) divides the data into subsets, leading to a final decision (or prediction).

Classification table example:

Let’s create a simple table based on the weather data:

Here, each row represents the weather condition, and the last column represents whether to play outside or not based on the weather condition.

Identify Pure and Impure Splits:

Pure Split: When all examples (or rows) at a split belong to the same category (eg all “Yes” or all “No” for playing outside).
Impure Split: When some examples (or rows) at a split belong to different category (eg some “Yes” or some “No” for playing outside).

Choosing features to split:

To build a decision tree.

Entropy: Measures randomness or impurity in the data. Lower entropy means more organised data.
Information gain: Measures how much a feature (question) reduces entropy. Higher gain means the feature is good for splitting.
Gini Impurity: Measures impurity of data. Lower Gini indicates more purity.

Mathematical Intuition:

Entropy: Imagine you have a bag of different color balls. Low entropy means most of the balls are of one color (no need of further splitting the tree).
Information gain: How much knowing the answer to the question reduces our uncertainty (entropy).
Gini Index: Measures how often randomly chosen element would be incorrectly labeled based on the distribution of the labels in the set.

Step-by-step calculation:

i) Calculate entropy of entire data (S):

Entropy measures the impurity or disorder in a set of data.

Number of “Yes” play outside = 2
Number of “No” play outside = 1
Number of “Maybe” = 1
Total examples = 4

Where:

p+ is the proportion of positive example (eg; “yes” class) in set S.
p- is the proportion of negative example (eg; “no” class) in set S.
log2 denotes logarithm base 2.

So, the entropy of entire dataset S is 1.5.

ii) Calculate information gain for each feature:

Where:

S is the current set of data.
feature is the feature (question) being considered for splitting.
values(features) are the possible values of the feature.
Sv is subset of S where feature feature has a value v.
|S| and |Sv| denote the number of examples in sets S and Sv.

Feature: Sunny

Feature: Rainy

Feature: Windy

Calculate information gain:

Now, calculate information gain for each feature.

Information Gain (S, Sunny)

Information Gain (S, Rainy)

Information Gain (S, Windy)

Conclusion:

Based on the calculations above, the information gain is highest for the feature “windy” (1.5), indicating that windy is the best feature to split on first according to the decision tree algorithm in this dataset. This process demonstrates how decision tree uses entropy and information gain to determine the optimal features for making decisions about data.

2. Decision Tree Regressor:

Decision tree regressor works similarly as classifiers but instead of predicting classes, they predict continuous value. The goal is to create a model that predicts the value of a target variable (eg: temperature, house price) based on the input features.

Example Dataset:

Let’s consider a simple example where we need to predict the price of the house based on its size (in square feet’s) and number of bedrooms it has.

Step-by-step explanation

Step 1: Build the decision tree:

i) Splitting criteria for regression

MSE (Mean Squared Error): Measures the average squared differences between actual and predicted value.

Where:

S is a set of examples at a particular node.
n is the number of examples in S.
yi is the actual value of the target variable at example i.
ˉyS is the mean of the S values

ii) Choosing splits:

At each node, algorithm considers to split the data based on the different features (eg. size, number of bedrooms).
It selects the split that minimizes the MSE most (maximizes the reduction in MSE).

Step2: Example Calculation

Calculating MSE for entire data (Root Node)

yi is the prices of the house

Step 3: Choosing the splits

Splitting on “size”

Size ≤ 1500

Size > 1500

Step 4: Recursive Splitting

Continue this process recursively, selecting splits that minimize the MSE until stopping criteria are met (eg. Maximum depth of the tree, minimum number of samples per leaf).

Conclusion:

Decision tree regressors work by recursively partitioning the data based on features, aiming to minimize the MSE at each step. They are effective for capturing non-linear relationships in the data and can be visualized to understand how decisions are made based on input features.

Important points about decision trees for data science interview.

Basic Concepts:

Decision trees are a type of supervised learning algorithm used for both regression and classification tasks.
They work by splitting the data into subsets based on the values of the features, creating a tree-like model of decisions.

2. Splitting Criteria:

For Classification: Use Gini or entropy to determine the splits.
For Regression: Use Mean Squared Error or Mean Absolute Error to determine the split.

3. Entropy and Information Gain:

Entropy measures impurity in a dataset.
Information gain measures the reduction in entropy when a dataset is split on a feature.

4. Gini Impurity:

Measures the likelihood of incorrect classification of randomly chosen elements if it was randomly labeled according to the distribution of labels in the data.

5. Pruning:

Pruning is used to reduce the size of the tree and prevent overfitting.
Can be done using pre-pruning (setting conditions for stopping the split early) or post-pruning (removing the branches after the tree has been created).

6. Handling Missing Values:

Decision trees can handle missing values by either ignoring them or by using techniques like surrogate split.

7. Feature Importance:

Decision trees can be used to rank the importance of the features based on the amount they reduce the impurity.

8. Advantages:

Easy to interpret and visualize.
Non-parametric, thus they don’t assume how the data is distributed.
Can handle both numerical and categorical data.

9. Disadvantages:

Prone to overfitting, especially with noisy data.
Sensitive to small changes in the data, which can result in different splits.

10. Ensemble Methods:

Bagging (Eg; Random Forest): Reduces variance by averaging multiple decision trees.
Boosting (Eg; XG-Boost, AdaBoost): Reduces bias and variance by combining weak learners to create a strong learners.

Unique points to remember about decision trees:

Interpretability:

Decision trees can be converted to set of rules, making it easier to understand the model’s decision process.
They can also be visualized graphically, providing a clear representation of decision made at each node.

2. Handling Non-Linear relationship:

Decision trees can capture non-linear relationships between features and the target variable, unlike linear models.

3. Scaling and Normalization:

Decision trees do not require feature scaling or normalization, which simplifies preprocessing.

4. Versatility in Loss Functions:

In regression tasks, decision trees can optimize for different loss functions (e.g., least absolute deviation instead of mean squared error) depending on the application.

5. Use in Feature Engineering:

Decision trees can be used to create new features based on the splits that provide the most information gain, potentially improving the performance of other models.

6. Treatment of Outliers:

Decision trees are relatively robust to outliers in the input space, as splits are based on feature values that partition the data.

7. Surrogate Splits:

When handling missing data, decision trees can use surrogate splits. These are alternative splits that can mimic the behavior of the primary split, allowing for the continued use of instances with missing values.

8. Interaction Detection:

Decision trees naturally detect and represent interactions between features. If two features interact to influence the target variable, the tree can capture this by splitting on one feature and then the other.

9. Tree Depth and Interpretability Trade-off:

There’s a trade-off between tree depth and interpretability. Deeper trees can capture more detail and potential overfitting, while shallower trees are simpler but might miss some nuances in the data.

10. Path Length and Decision Confidence:

The length of the path from the root to a leaf node can be interpreted as a measure of the confidence in the decision. Shorter paths usually indicate more confident decisions based on more certain splits.

Why reading is better than watching YouTube videos?

Reading let’s your imagination run wild. You can create pictures in your mind, and that’s like having personal movie in your mind.
When you are reading you will be focused on words, but while watching videos you will be distracted by flashy visuals and adds.
Reading encourages you to think. You pause, reflect and understand things deeply. It’s like having conversation with the author itself.
Reading is patient game. It’s not a race. You learn to enjoy the journey, and that helps for you in many areas of the life.

Complete Math Intuition Behind Decision Tree

Written by Anju Reddy K