What is a Decision Tree? Classification and Regression Trees | Let’s build Decision Trees | Python Code — Part 1

14 min readOct 19, 2022

Decision Tree is one of the most powerful Supervised Learning algorithm used for both Classification and Regression.

Intuition

This algorithm assumes that the data follows a set of rules and these rules are identified by using various mathematical techniques.

Decision Trees use hyperplanes, which are parallel to any one of the axes, to cut the coordinate system. Each axis defines a feature. The dotted lines are hyperplanes. o and x are the data points.

Decision Tree builds classification or regression models in the form of a tree structure.

How a Tree Structure is built?

Decision Tree uses a binary (each node has two children) or non-binary tree graph (each node has more than two children) to assign a target values for each data sample. The target values are presented in the tree leaf(or leaves). To reach to the leaf, the sample is propagated through nodes, starting at the root node. In each node a decision is made, “to which descendant node it should go?”, these nodes are called as decision nodes. The decision is made based on the selected sample’s feature(feature attribute).

Decision Tree learning is a process of finding the optimal rules in each decision node according to the selected metric.

Tree structure

Let’s say there is this dataset, then the dataset breaks down into smaller branches(subtree) while at the same time an associated decision tree is incrementally developed(look at the right side, the tree is growing). By the way, the Decision tree can handle both categorical and numerical data.

Doesn’t it looks like a giant structure of nested if-else conditions?

Decision Tree Construction Process

The Decision Tree construction is a step by step process:

Feature selection: Select a feature from the features of the training data as the split standard of the current node. (Different standards generate different decision tree algorithms.)
Decision tree generation: Generate internal node upside down based on the selected features and stop until the dataset can no longer be split.
Pruning: The decision tree may easily become overfitting unless necessary pruning (including pre-pruning and post-pruning) is performed to reduce the tree size and optimize its node structure.

Algorithms to build Decision Tree

There are two popular algorithms widely used to construct the Decision trees:

CART (Classification and Regression Trees) — uses Gini Index(for Classification), Standard Deviation Reduction (for Regression) as metrics.
ID3 (Iterative Dichotomiser 3) — uses Entropy function and Information gain as metrics.

Key Points of Decision Tree Construction

Before we begin constructing a Decision Tree, we need to ask the right questions, like,

How to decide which feature should be considered as Root Node?
How to select Decision Nodes and descendent Decision Nodes?
How to decide splitting criteria for features?

Decision Tree for Classification

The key step of constructing a decision tree is to divide data of all attributes(features/columns), compare the result sets in terms of ‘purity’, and select the attribute with the highest ‘purity’ as the data point for dataset division. We have two commonly used splitting criteria metrics to quantify the purity.

Entropy and Information Gain
Gini Index

Entropy

Entropy is the measure of disorder. It measures the unpredictability or impurity in the system. Below we can see when red and green dots are distributed in order then the entropy is low. But when there is randomness in the distribution, the entropy is high.

For a dataset with only 1 class, the samples are homogeneous(which means all samples are similar, belong to same class), Entropy will be 0.
For a dataset with 2 classes, when there is no impurity(all instances belong to same class) then Entropy becomes 0. When impurity is high, which means instances of each class are equally distributed(50% each), the Entropy is 1. Below we have a graph of enthropy for probability of class “+” , where there are 2 classes “+” and “-”.

3. For a dataset with more than 2 classes, the Entropy can be greater than 1 for higher impurity, and Entropy is 0 when impurity is low.

The mathematical equation to measure Entropy is

Pi,k is the ratio of class k instances among the training instances in the ith node.

Entropy for continuous variables

If you observe, till now we have discussed Entropy for categorical data. But we said earlier that Decision Tree is also used for Regression tasks. So, how we measure entropy when data is continuous(numericals)? Let’s see!

Look at this kdeplot. A wide, uniform looking distribution where we are more equally unsure of the probabilities of each outcome has more uncertainty and hence has higher entropy. Conversely, a distribution with a very narrow peak would indicate that we know with high probability what our outcomes are, or in other words they lie in a narrow range and so we are more certain, thus indicating that this distribution would have low entropy.

Kdeplot is a Kernel Distribution Estimation Plot which depicts the probability density function of the continuous or non-parametric data variables

To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:

Step 1: Entropy using the frequency table of one attribute:

Here is the frequency table of one attribute(target variable), on computing we got Entropy = 0.94

Step 2: Entropy using the frequency table of two attributes:

Information gain(IG)

Information gain is a metric to train Decision trees. This metric measures the quality of the split. The information gain is based on decrease in entropy after the dataset is split based on an attribute. We need to find out that attribute that returns highest information gain(most homogeneous branches). The Decision Tree algorithm follows various steps to get higest IG.

Information Gain IG

Step 1: Calculate the entropy of the parent before splitting. From the frequency table of target variable(play golf) entropy is calculated first; Entropy = 0.94

Step 2: Calculate the entropy of each child. The dataset is split on the different attributes(Sunny, Overcast, Rain) of a feature(Outlook). Then the entropy of each attribute is calculated.

Step 3: Calculate weighted entropy of children. For feature(Outlook), the product of the entropies( that we calculated in step 2) and probablitity of each child is calculated. Then sum of all the products is calculated. We got weighted entropy of children. Weighted entropy = 0.693

Step 4: Calculate Information Gain. We have equation of Information gain, all we need to do it substitute the values that we got in previous steps in this equation.

Step 5: Calculate Information Gain IG for all features. The algorithm calculates IG for all features and chooses the feature with the largest IG as the decision node.

Look at the pictures, we can see the highest Information Gain is for Outlook feature. Decision Tree chooses Outlook feature as the root node.

In next step, the Decision tree divides the dataset by its branches and repeats the same process on every branch. Let’s see how!

Step 6: Find Information Gain recursively. The Decision Tree then applies a recursive greedy search algorithm in top to bottom fashion to find IG at every level of the tree. Splitting stops once we reach to a branch with entropy 0, this branch is the leaf node.

The final decision tree would look like this,

Gini Impurity

Gini Impurity or Gini Index measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. It is used to determine how the features of a dataset should split nodes to form the tree. The formula of Gini Impurity is,

The degree of gini index varies from 0 to 1,

Where 0 depicts that all the elements be allied to a certain class, or only one class exists there.
The gini index of value as 1 signifies that all the elements are randomly distributed across various classes.
A value of 0.5 denotes the elements are uniformly distributed into some classes.

There is not much difference in the Decision trees that we get fromx`x` Entropy and Gini Impurity. But, Gini is computationally slightly faster than Entropy. Gini is useful on large datasets. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.

Regularization Hyperparameters

In Decision Trees, the tree structure will adapt itself to the training data, which leads the model to overfit on training data and the model fails to generalize on test set/data. To avoid this problem, the Decision Tree’s degree of freedom has to be restrained during training. This process is called as Regularization. The regularization hyperparameters depend on the type of algorithm. In Decision trees we can restrain the Depth of the tree to prevent the model from overfitting. Depth of the tree is a measure of how many splits can a Decision Tree make.

In Scikit-Learn DecisionTreeClassifier class is used to build the decision tree. Some important hyperparameters that restrict the freedom/shape of the Decision Tree are:

max_depth (The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples). The lower max_depth value leads to underfitting and higher value leads to overfitting.
min_samples_split (the minimum number of samples a node must have before it can be split)
min_samples_leaf (the minimum number of samples a leaf node must have)
min_weight_fraction_leaf (same as min_samples_leaf but expressed as a fraction of the total number of weighted instances)
max_leaf_nodes (maximum number of leaf nodes)
max_features (maximum number of features that are evaluated for splitting at each node).

Increasing min_* hyperparameters or reducing max_* hyperparameters will regularize the model.

Here are two Decision Trees trained on the a dataset. On the left, the Decision Tree is trained with the default hyperparameters (without restrictions), and on the right the Decision Tree is trained with min_sam ples_leaf=4. We can see that the left model is overfitting, while the right model generalize better.

Decision Tree for Regression

Decision Trees are also capable of performing regression tasks. Regression tree looks very similar to the classification tree. Contrary to the classification problem the estimation of the target variable in a regression problem is a continuous value and not a categorical value.

In the following system we can see the decision tree made four splits and divided the data points into five groups.

Now, this algorithm will take the average value of each group and based on that values it will build the decision tree for the dataset. The resultant tree would look like this,

How Does a Decision Tree Work for Regression?

The ID3 algorithm (Iterative Dichotomiser 3) can be used to construct a Decision tree for Regression which replaces the Gini Impurity or Entropy and Information Gain metric with the Standard Deviation Reduction (SDR).

Standard Deviation Reduction is based on the decrease in Standard Deviation after a dataset is split on an attribute. The tree will be split by the attribute with the higher SDR of the target variable. The standard deviation is used to measure homogeneity(similar values) of the numerical samples(target variable). The mathematical formula of Standard Deviation Reduction is,

Y is target variable
X is a specific attribute of the dataset
SDR(Y,X) is the reduction of the standard deviation of the target variable Y given the information of X
S(Y) is the standard deviation of the target variable
S(Y,X) is the standard deviation of the target variable given the information of X

The node (independent variable) which has more reduction in standard deviation that will be declared as a Decision node. In first iteration this node will be called as root node.

If the numerical sample is completely homogeneous with respect to independent variable (we check for each independent variable separately), then its standard deviation is zero.

The calculation of S(Y,X) can be broken down with the following formula,

S(Y,X) is the sum of the probability of each of the values of the independent variable multiplied by the standard deviation of the target variable based on those values of the independent variable. Here,

S(c) is the standard deviation of the target variable based on unique values of the independent variable.
P(c) is the number of probability of the value of the independent variables over the total values of the independent variable

The feature with the higher value of S(Y,X) is the feature that has more capability to explain the variations in the target variable, therefore is the feature that the decision tree algorithm will use to split the tree.

Reduction in Variance or Standard Deviation Reduction (SDR)

Variance tells us the average distance of all data points from the mean point. Standard deviation is the square root of the variance. As variance is calculated in squared unit and hence to come up a value having unit equal to the data points, we take square root of the variance and which is Standard Deviation. The mathematical formulas which we need are,

Standard Deviation is square root of Mean Square Error (MSE).

Let’s see the step by step process of building Decision tree for Regression. Here, we have added a new feature as Hours Played, to the weather dataset

Step 1: Calculate the total standard deviation with respect to target variable, before split.

Coefficient of variation is used as the stopping criteria for further split.

Step 2: Calculate SDR with respect to independent variables and decide on root node, decision nodes and leaf nodes.

SDR for Outlook:

Similarly, when we calculate SDR for all attributes, we get,

SDR (Hours, Outlook) = 1.66
SDR (Hours, Temperature) = 0.39
SDR (Hours, Humidity) = 0.09
SDR (Hours, Windy) = 0.39

Step 3: Deciding the root node

Outlook has most reduction in standard deviation hence outlook will be the root node.

Step 4: The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches, until all data is processed.

For the next node we need some termination criteria like when to stop splitting. Hence, we consider coefficient of variance CV as one of the termination criteria and for example can set CV = 10% as a threshold. So, if we have CV > 10% then further splitting is required else will stop there. Or we can also set threshold on number of data points (rows) if less than or equal to 3 then no split required.

The prediction at leaf node will be the average of that branch. Here Sunny and Rainy nodes need further splitting as both have CV > 10%. However, Overcast does not need splitting as CV for overcast <= 10%.

Average value of overcast branch = 46.25 will be the prediction at overcast leaf.

Now we will repeat the same SDR check for data filtered for Sunny and Rainy, to get other decision nodes and subsequent leaf nodes.

SDR check for Sunny data rows

Look below, “Sunny” branch CV(22.10%) > 10% threshold, which needs further splitting. We select “Temp” as the best node after “Outlook” because it has the largest SDR.

Because the number of rows for each Hot, Mild and Cool are < 3, no further split is required.

SDR check for Rainy Data rows

The “Rainy” branch has an CV (22%) > 10% threshold. This branch needs further splitting. We select “Temp” as the best node because it has the largest SDR.

Because the number of data points for both branches (FALSE and TRUE) is equal or less than 3, we stop further branching and assign the average of each branch to the related leaf node.

The final Decision Tree for Regression is here,

Decision Tree Interpretation:

If Outlook condition is Sunny and Temperature is mild then prediction on number of hours match can be played is 41.5 hours irrespective of other conditions.
If Outlook is Overcast then irrespective of other conditions, prediction is 46.25 hours.
If Outlook is Rainy then if it is Windy then prediction is 26.5 hours and if it is not windy then prediction is 47.6 hours.

Advantages and Disadvantages of Decision Trees

Advantages

Less Data Preprocessing — Unlike other machine learning algorithms, a decision tree does not require well-preprocessed data.
No Normalization — Decision tree does not require normalized data.
No Scaling — The data doesn’t need to be scaled before feeding it to the decision tree model.
Not Affected by Missing Value — Unlike k-Nearest Neighbors or other distance-based algorithms, a decision tree is not affected by missing values.
Easy and Intuitive — A decision tree is intuitive and fairly easy to understand and explain the underlying properties.

Disadvantages

Highly Sensitive — A small change in data can cause high instability to a decision tree model.
Complex Calculation — A decision tree uses more complex computation compared to other models.
High Training Time — It takes higher time to train a decision tree. model.
Costly — Decision tree based models require more space and time, so it is computationally expensive.
Weak — A single tree can not learn much of the data, therefore, you won’t get a good predictor using a single decision tree. You need to ensemble a higher number of decision trees like Random Forest to get better prediction accuracy.

Conclusion

Decision trees are most powerful in Machine Learning. For being simple to understand, they are preferered over other approaches to same problems. One can use decision trees to perform basic customer segmentation and build a different predictive model on the segments.

In next blog let’s see how to implement Decision Tree algorithm for both Classification and Regression tasks using Scikit-Learn.