How does a Decision Tree Classifier works in Sci-kit Learn?

Published in

Analytics Vidhya

4 min readSep 22, 2020

What is a Decision Tree ?

Based on the dataset available a decision tree learns the if/else hierarchy which ultimately leads to a decision making. Decision Trees are widely used models used for classification and regression tasks in Machine Learning. The classification can range from being a binary classifier to a “multi class “ classification and the regression can predict the values for the test data ( or new instances) after being trained by the train dataset.

The Algorithm we discuss here is the CART (Classification and Regression) Algorithm, which uses Decision tree for making classifications and predictions using Scikit-Learn’s (Sklearn) DecisionTreeClassifier and DecisionTreeRegressor modules located in sklearn.tree.

How a DecisionTreeClassifier works?

A DecisionTreeClassifier has the below parameters passed in

DecisionTreeClassifier(criterion=’gini’, max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=42, splitter=’best’)

We discuss mainly about the parameter “Criterion = ‘gini’”. Gini is the criterion in which the decision tree has to consider while splitting the available features in the dataset (quality of the split), to make a decision.

Consider the below table to decide which feature (Gender/Occupation) will have the first split in the decision tree. The Quantitative measurement of the quality of the split is given by Gini Impurity — What’s the probability we classify the data point incorrectly?

Step 1— Calculate the Gini impurity before splitting for the target column (Buying Apparel)

Gini overall = 1 — (probability of No)² — (probability of Yes)²

This Gives Gini overall as (1-((4/8)²+(4/8)²)) = 0.5

Step 2 — Select a feature (in this case we select Gender) to calculate Gini Split (Amount of impurity for a particular split).

For M , (1-((2/5)²+(3/5)²)) = 0.48

For F , (1-((2/3)²+(1/3)²)) = 0.45

For Gender (Weighted avarage) , (5/8)*0.48 + (3/8) * 0.45 = 0.47. Hence Gini split of Gender column = 0.47

Step 3 — Calculate Gini Gain (The amount of impurity removed using the split of a particular feature)

Gini Gain = Overall Gini — Gini Split = 0.5–0.47 =0.03

Similarly Gini Gain for Occupation = 0

The first split of the Tree happens with Node — Gender column since the Gini Gain is more (Higher the Gini Gain the first is the split of a particular feature).The recursive partitioning of the data is repeated until each region in the partition(each leaf in the decision tree) only contains a single target value (a single class or a single regression value). A leaf of the tree that contains data points that all share the same target value is called pure.

We can also check the splitting of the features (which feature to split first)by using feature_importances_ method of DecisionTreeClassifier. DecisionTreeClassifier.feature_importances_ yields a 1D Array containing the scores of all the features ready for the split and based on the scores ,in descending order , the split for a decision tree takes place.

What is Pruning ?

In order to avoid over-fitting with respect to train dataset, a tree shall be maintained with its unnecessary branches pruned (cut down) , particularly not to reach the classification to reach min_sample_split to 2and min_sample_leaf to 1. There are two methods — Pre-pruning (limiting the tree before the split happens) and Post-pruning (pruning the tree after the classification happens till the min_sample_leaf). Sci-kit learn accommodates only Pre-pruning, which can be done by the three parameters max_depth(Limiting the levels of the tree split),min_samples_leaf( Threshold no.of samples for stopping the split and making the node as leaf node), min_samples_split (Least number of samples to be present at a given node for the split to happen). By controlling (hit and try)the above parameters we can achieve the train and test accuracy with out the problem of overfitting the model for a given dataset.

DecisionTreeClassifier for breast_cancer data available in sklearn datasets

Splitting in decision tree is not prone to normalization or standardization of features, moreover, Decision Tree algorithm works best for features in different scales along with a mix of discreet and continuous features

The main downside of decision trees is that even with the use of pre-pruning, they tend to overfit and provide poor generalization performance. Therefore, in most applications, the ensemble methods are usually used in place of a single decision tree.

How does a Decision Tree Classifier works in Sci-kit Learn?

Written by Pratap R Jujjavarapu