S31: Decision Tree Analysis

Bhavyaa Bansal
4 min readJan 16, 2024

--

A decision tree is a flowchart-like tree structure where each internal node denotes the feature, branches denote the rules and the leaf nodes denote the result of the algorithm. It is a versatile supervised machine-learning algorithm, which is used for both classification and regression problems.

Decision Tree Analysis is also called as recursive partitioning algorithm, we we partition dependent variable based on independent variable. This partition is done to obtain homogenous groups. Partition will increase the information and reduce the misclassification. Recursive in the sense, only one partition can be applied at a time.

What is Gini Impurity?

Gini Impurity is a measurement used to build Decision Trees to determine how the features of a dataset should split nodes to form the tree. More precisely, the Gini Impurity of a dataset is a number between 0–0.5, which indicates the likelihood of new, random data being misclassified if it were given a random class label according to the class distribution in the dataset.

How to Determine Whether a Split Maximizes Homogeneity

To determine the particular split that maximized homogeneity the algorithm used a purity measure, Gini impurity. If a set was very mixed it was “impure” and the Gini impurity would be high. As the proportion of re-payers to defaulters increased and the rectangle became more pure, the Gini impurity would decrease. The difference in Gini impurity from, for example, the original data set to the contents of R1 and R2, was information gain. Information gain and Gini impurity were inversely related.

The Gini impurity calculation was based on expected error. If we knew the class probabilities of defaulters and repayers from our sample, and we can choose an applicant at random, it could calculate the probability of misclassifying the applicant. For any point on either the credit score or income axis we could calculate the Gini impurity (or misclassification error):

Gini impurity =

P (picking a repayer) × (1- p (picking a repayer)) + P (picking a defaulter) × (1- p (picking a defaulter))

The process for calculating Gini impurity was as follows.

1. Establish the baseline Gini impurity. This calculation was based on the number of classes in the starting population and would be at its highest at this moment. In this example, Gini impurity would equal P(repaid) + P(default). (However, imagine a third category of borrower, one who was late on payments, but had not yet defaulted. In that case the Gini impurity would equal P(repaid) + P(late) + P(default).) In the current example the 12 defaulters and 12 repayers are all in the same pool, so the Gini impurity was ½, which in fact was the maximum value for the measure, given only two classes.

2. Compare new Gini impurities for any possible cut. With the baseline established, a software program could then test each possible placement of a cut (on either the credit score or income axis), calculate the new Gini impurity values for each possible partition, R1 and R2, calculate the difference between baseline and new values, and then place the cut to optimize reduction of the Gini impurity. At credit score equal to 681, then Gini impurity equal to:

Then, the gini impurity of R1/R2 = Proportion of observations in R1* gini impurity of R1+ Proportion of observations in R2* Gini Impurity of R2

The Gini impurity was reduced from 0.5 to 0.3286 as a result of this segmentation.

We iteratively split the following data, and get the respective decision tree.

Some of the branches can be interpreted as

  1. If 681 < credit score < 768 and income < $56,500 then = 1 (default) .
  2. If credit score < 681 and income < $84,500 then = 1 (default)
  3. If 704.5 < credit score < 712 and income > $56,500 then = 1 (default)

Decision tree by-default will in order to reduce Gini impurity makes dedicated branches for outliers cases as well. These are the cases where the case count is only 1 (ex R12). Tree will become complex and the scenario would turn to over-fitting. Complexity of tree is directly proportional to over-fitting. Pruning is done to reduce the complexity and maintain the tree. Please refer : https://www.kdnuggets.com/2022/09/decision-tree-pruning-hows-whys.html

Pruning reduces the size of decision trees by removing parts of the tree that do not provide power to classify instances. Decision trees are the most susceptible out of all the machine learning algorithms to overfitting and effective pruning can reduce this likelihood.

The complexity of tree is measured under CP (complexity). We do parameter tuning (of CP) to find the threshold information for the branches. This will help us eliminate the branches which do not contribute much to information gain but increase complexity (and hence over-fitting). Tuning and identifying optimal complexity of tree is parameter tuning.

Sometimes not all independent variables are used in decision tree because of its inability to meaningfully partition data.

--

--