Tree Algorithms: Decision Trees

Bhanu Kiran
6 min readApr 28, 2022

--

https://media.giphy.com/media/U1kzXpYpIrXZ6/giphy.gif

One of the most popular models out there is decision trees or just “trees” used for classification and regression. But what is a tree model? Well, breaking down the name itself, a decision means to make a choice between an array of choices, and a tree, well, a big green plant with branches, sounds ridiculous of course!

In this blog, I shall talk about the “behind-the-scenes” working of a decision tree algorithm, the last bit of the blog includes how to run a decision tree algorithm using a dataset in Python.

To put into context, a decision tree model in simple terms is a set of if-else statements. Although they are not the most powerful models, they are strong and good to use in certain contexts.

The underlying algorithm behind decision trees is known as Recursive Partitioning, which is quite simple and easy to understand.

Recursive Partitioning

This intuitive algorithm is the “behind-the-scenes” to construct a decision tree. When presented with data of different labels, recursive partitioning starts to partition the data using predictor values such that each partition is homogenous. In other words, the data is divided into partitions such that each partition has only one label of data.

To get a better idea, let’s try to partition animals! Consider the scatter plot below to be all the animals available in a list.

First, let’s partition them by checking if they have feathers or not.

From this partition, one can assume that the points in green and grey are birds, while the points in white are other animals. We can go further and partition to see if the birds can fly or not.

The birds in grey cannot fly while the birds in green can. The same structure can be visualized in the form of a tree, with the root node as animals.

In general, recursive partitioning creates a decision tree, more like an upside-down tree that starts with the root node at the top and branches down to its leaves. The aim is to correctly classify the members of a sample based on a predictor variable, in other words, a variable that takes one of only two possible values, when observed or measured.

The way the decisions are made at each node is through majority votes, and these are done through class proportions

P/Ptotal

If we observe the class proportions we see that there are 15 green points, 8 grey points, and 12 white points. The root node starts with a proportion as a whole with all the points.

To understand the majority vote, let’s start from the root node which checks if an animal has feathers, since we know that green and grey points are birds, 14/15 of them and 7/8 of them have feathers and are classified as birds, whereas 1/15, 1/8, and 12/12 of them are in another class since 12/12 is the majority, they are classified as animals that are not birds, and the process repeats so on until a target is reached, in this case classifying the bird like a pigeon.

Of course, you can see, that among the animals that are classified as not birds, there is still one green point and one grey point in the overall partition. These points are known as impurities. In order to minimize the classification error, we have to increase the degree of purity. This can be done by using the two common measures of impurity, the Gini impurity, and entropy of information.

We have to decrease the impurities such that we have only one class label on one leaf node. A method to do this is based on information theory, known as Shannon’s Entropy.

Shannon’s Entropy

The idea, in a nutshell, is that it gives you the possibility to measure the amount of information, and it’s quite simple

where Pi is the probability, in this case, the fraction of each class, and c is the number of class labels.

To increase purity, our goal is to minimize Shannon's entropy. We can calculate this by taking the entropy of the left and right sides and adding them up.

In our case when making decision trees, we can make trees without knowing about entropy as a whole, by just using if-else logic. But starting from the partitions, then constructing a decision tree, we need to understand if a node has to be split into leaves. And the split happens because of entropy. In our case, our tree is split based on true or false values. If the data is completely homogenous, the entropy is 0, if the dataset is equally divided, the entropy is 1. This means that the value of entropy is in the range of 0 to 1. Therefore, we calculate the entropy for each node and split it if the value is close to 1, the closer to 0 the value of entropy, the purer the classification is.

The Tree Keeps Growing!

As the tree grows bigger and bigger, the splitting rules become more detailed, this shifts the tree from identifying big rules to identifying rules that reflect only in noise. Thus, a fully grown tree has only pure leaves, and thus we get 100% accuracy, or in other words, the tree tends to overfit. Fitting the noise in the training data is not ideal for us, as we want to identify new data.

One way to stop this from happening is to combine cross-validation with either systematically changing the model parameters, or modifying the tree through pruning.

To sum it all up, a decision tree with basic modeling tends to overfit, become more complex with an increase of features/classes, and tend to be suboptimal.

Decision Tree in Python

The dataset used in this code is the Red Wine Quality dataset found on Kaggle.

Python

#import libraries
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
# importing dataset in a kaggle notebookdata = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')# split the data into target and features
target = data.iloc[:, -1]
features = data.drop(['quality'], axis = 1)
# set X as features and y as target for a 70/30 test train splitX = features
y = target
# test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# model the decision tree
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# print accuracy of prediction
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

So that’s the decision tree and what happens in the background when you’re modeling your algorithm. To read about these, you can follow these books which I have used for reference —

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python

An Introduction to Statistical Learning: with Applications in R

You can also read about linear regression and hierarchical clustering in my other blogs

Clustering Analysis: Hierarchical Clustering

Linear Regression, an in-depth view

--

--