Decision Trees

9 min readFeb 25, 2023

Random forests are one of the best algorithms for regression and classification tasks. It is known for handling large datasets with high dimensionality, handling missing data, and capturing complex relationships between variables.

Random forest is a powerful ensemble learning model, which means that it is made up of multiple models which are contributing to providing a better result than a single model. In the case of random forests these “multiple models” are nothing but Decision Trees.

Introduction

As the name suggests, this model takes its inspiration from a tree. Not for how it works but how it looks. There are nodes and branches.

The node where the tree starts is called a root node. The intermediate nodes, which originated from another node and in turn lead to the origin of further nodes, are called internal nodes. And then comes the last node. These are called leaf nodes.

Well, this was the explanation for the word tree in the model’s name. The word “Decision” in the name holds the same meaning as it does in English dictionaries, which is “a conclusion or resolution reached after consideration”. The considerations here are taken based on the data. You can imagine the root node as one of the features asking a certain question to the incoming data and then sending it further down to one of the nodes originating from itself as per the data’s response. The same thing will happen in the internal node. It will constitute another feature which will be asking another question and then direct the data to another node. This keeps on happening until a node has achieved some sort of “purity”(we will discuss this later). Along with the question, each node also denotes a class or target. This class has been assigned to the node as per the questions which have been asked till now(features used till the node). You see, here each node is making a decision and each decision contributes to the eventual decision at the end.

How do we build a Decision Tree?

We will be using the following table to explore the different concepts here.

The first step would be to select the one feature which would result in the best possible classification. Let’s consider all the prospects here and create a root node out of them.

There is no winner here. None of the questions provides a good classification of the data. The internal nodes have a mix of Yes and No. But we have to choose one of them to start. We need something to quantify the classification. For this, we will be using Gini impurity or Entropy.

Gini impurity

Gini impurity is a measure of the impurity or the heterogeneity of the dataset. It ranges from 0 to 1, where 0 represents an entirely pure class(all elements belong to the same type) and 1 represents an entirely impure set (all classes are equally likely to occur).

To calculate the Gini Impurity, given C total classes and the probability p(i) of selecting a data point with class i, you can use the following formula:

The idea here is that every new node must reduce the Gini impurity. First, we calculate the impurity for the dataset.

Probability(Yes)=4/10; Probability(No)=6/10

Gini = 1-(0.4²+0.6²) = 0.48

Now for the new nodes, we calculate the Gini for individual nodes and then add them as per their weights. Let us consider the first tree (Genre).

Gini for the left node = 1-(0.5²+0.5²) = 1–0.50 = 0.50

Gini for the right node = 1-(0.25²+0.75²) = 0.38

Weighted sum = 0.6*0.5+0.4*0.38 = 0.45

Hence the impurity for the first tree(Genre) is 0.45 which is lower than the impurity of the dataset. Similarly, we calculate the impurity for the next two trees.

Gini impurity for the second tree(#pages) = 0.47

Gini impurity for the third tree(rating) = 0.40

Here the third tree has the least impurity of them all. Hence, we will be creating the root node with the rating feature.

The gini can be even used to identify the most important feature of a dataset. The feature which results in the least gini value is the most important one.

Entropy

Entropy is the measure of impurity or randomness in the dataset. It can be used as a substitute for Gini. However, there isn’t much difference between the two from the performance perspective. The decision for the branching is the same as that for Gini.

Similarly, we keep on calculating the gini/entropy for every new node until we reach a node which satisfies the stopping criterion. Stopping criteria are conditions used to stop the construction process of a decision tree. Several common examples of stopping criteria include maximum tree depth, the minimum number of samples per leaf, minimum impurity decrease, maximum number of leaf nodes, time limit, and early stopping. These criteria can be used in isolation or together to create decision trees that are more precise, easy to understand, and efficient. The selection of a stopping criterion is dependent on the specific problem and the data properties being examined.

Well, this process has a defined set of instructions that specify a sequence of operations to be carried out in the order to solve a specific problem or accomplish a specific task. Sounds like an algorithm. This is the CART algorithm. There are a few other algorithms also which are used to create decision trees.

There are multiway decision tree algorithms which enable a tree to branch in more than two nodes. The following are some examples of multiway decision tree algorithms:

CHAID (Chi-squared Automatic Interaction Detector): CHAID is a decision tree algorithm that can handle categorical predictor variables with more than two categories. It uses chi-squared statistical tests to determine the most significant variable interactions.
CRUISE (Classification Rule with Unbiased Interaction Selection and Estimation): CRUISE is a decision tree algorithm that uses statistical tests to identify significant interactions between variables. It can handle both continuous and categorical predictor variables.
M5: M5 is a decision tree algorithm that can handle both continuous and categorical predictor variables. It uses a combination of linear regression and decision trees to build models that can handle complex datasets.
QUEST (Quick, Unbiased, and Efficient Statistical Tree): QUEST is a decision tree algorithm that uses statistical tests to select the best splitting variable. It can handle both continuous and categorical predictor variables and can identify interactions between variables.

Decision trees for regression

Decision trees can be used not only for classification but also for regression problems. Whilst a decision tree for classification is intuitive, regression using decision trees, might need some discussion.

In decision tree regression, the target variable is a continuous variable, rather than a categorical variable. The goal is to predict the value of the target variable based on the values of the predictor variables. Just like classification, the dataset is split into smaller subsets based on the values of the predictor variables. But instead of using Gini/Entropy as the splitting criterion, Mean squared error (MSE) or Mean absolute error (MAE) is used. Once the tree is created, every leaf node has a set of target values from the training data. The prediction here is done by calculating the average of those target data points.

However, when there is a large dataset, building a decision tree becomes computationally expensive and may result in overfitting. To tackle this issue and in order to get better accuracy, we use discretization. Discretization is dividing the data into intervals which reduces the number of possible values and simplifies the decision tree. Discretization can also help improve the accuracy of decision tree regression models by reducing the impact of outliers and noise in the data. By dividing the predictor variable into intervals, extreme values and outliers are grouped together with similar values, reducing their impact on the overall model. Discretization can be performed using several methods, including:

Equal width binning: This method divides the range of values into a fixed number of intervals of equal width. For example, if we want to divide a variable into five intervals, we would divide the range of values into five equal-width intervals.
Equal frequency binning: This method divides the range of values into intervals such that each interval contains an equal number of observations. This method can be useful for handling skewed distributions.
Clustering-based methods: This method groups similar values into the same interval based on some distance or similarity measure.

Overfitting prone Decision trees

Decision Trees tend to overfit as they can fit intricate decision boundaries to the training data that may not generalize well to new, unseen data. Decision Trees exhibit high flexibility, enabling them to split the data in various ways, generating numerous decision rules, and resulting in a model with high variance.

Decision Trees can be particularly susceptible to overfitting in certain situations. For example, if the tree is allowed to grow too deep, it can create decision boundaries that are overly specific to the training data and may not apply well to new data. Similarly, if the tree is too complex, with too many features or decision rules, it can also overfit the training data. Another potential cause of overfitting is when the training data is noisy or contains irrelevant features, which can lead the DT to create decision rules that are not useful for detecting underlying patterns in the data.

To prevent overfitting in DTs, various techniques can be employed. One approach is truncation, which involves limiting the depth of the tree or the number of samples required to split a node. By restricting the tree’s depth or complexity, truncation helps prevent the model from becoming too complex and overfitting the training data. This technique is useful for creating simpler, more interpretable models that still capture the essential features of the data.

Another approach is pruning, which involves building the full decision tree and then removing branches that do not improve the model’s performance. By removing unnecessary branches, pruning creates a more compact and interpretable tree that still captures the essential features of the data. Pruning can help reduce overfitting by removing decision rules that are specific to the training data and may not generalize well to new, unseen data.

Regularization is another technique that can be used to prevent overfitting in DTs. This method involves adding a penalty term to the objective function that balances the complexity of the tree with its accuracy on the training set. By adding a regularization term, the model is encouraged to produce simpler trees that generalize better to new data.

Finally, ensemble methods can be used to reduce the variance of DT models and improve their generalization performance. Bagging, boosting, and random forests are examples of ensemble methods that can be used with DTs. These techniques involve combining multiple DTs, each trained on different subsets of the training data, to create a more robust and accurate model. By combining multiple models, ensemble methods can help reduce the risk of overfitting and improve the model’s ability to generalize to new data.

Summary

In this blog, we have explored various aspects of decision trees, including the building process, algorithms, and criteria used to make decisions in the tree, such as Gini impurity and entropy. We have also discussed discretization techniques and the use of decision trees for regression. Additionally, we examined the risk of overfitting in decision trees and methods to prevent it, including truncation, pruning, regularization, and ensemble methods.