(Image credits: Daniela Duncan)

Decision trees in Python

Alessandro
CodeX
Published in
5 min readAug 27, 2021

--

Decision trees are a supervised machine learning model used for both classification and regression tasks (CART). They are easy to implement, explain and are among the most utilized machine learning tools.

Trees are non-parametric nonlinear models, meaning that their function can take as many parameters as needed (differently to a linear regression fixed on a = intercept, b = slope) and generate nonlinear mapping of X to y.

Classification and regression trees (CART)

Decision trees can be split into 2 categories:

  1. Regression trees: decision trees where the target variable can take continuous values (usually numbers).
  2. Classification trees: tree models where the target variable takes a discrete set of values (classes).

All trees are composed of nodes and branches. The first node is also called the root node, the final nodes (predictions) are called leaves and the nodes in between are the interior or decision nodes. Nodes are connected by branches.

Fig 1 — Decision tree.
Fig 1 — Decision tree

Main idea: Information Gain (IG)

A decision tree will split each parent node into two children nodes based on the feature that will achieve the maximum Information Gain at that stage. This process will start at the root node (first split) and will continue until all children nodes are pure, or until the information gain is 0 (unless we manually set a “max_depth” parameter).

The information gain measures by how much the entropy (a measure of impurity or randomness of a dataset) is reduced by splitting the data on a given feature and the tree grows its branches by consecutively splitting up on features, from the most predictive to the least.

Minimizing the entropy is equivalent to maximizing the information gain and that’s exactly what the tree aims to achieve at each split.

The two main criteria to measure the impurity of a node I, in a dataset with n different classes of probability P, are the Gini index:

and Entropy:

The Gini index tends to be computationally more efficient hence it’s most often implemented in the decision tree algorithm. To anyone interested in knowing more about it I recommend this video. This helped me visualize the idea behind the Gini Index and ultimately better understand how a tree model works.

Implementing a Classification Tree with Scikit-Learn

In the next section, I will write some code to demonstrate an easy application of a classification tree on the well-known Iris dataset. This is easily accessed through Scikit-Learn and is ready to be worked on:

from sklearn import datasets
iris = datasets.load_iris()

…but I’ll import it as a .csv to make this application more general.

Let’s start importing all the modules:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Import and read the dataset:

df = pd.read_csv("Desktop/Iris.csv")
print(df)
Fig 2— Iris dataset

Our features X are the SepalLenghtCm, SepalWidthCm, PetalLenghtCM, PetalWidthCM and our target variable Y is the Species.

To use our features we need to create a matrix X where each row represents an instance and every column a feature. We can achieve this with a NumPy array that takes the data from the columns of interest and then transposing it. We also need to create a column vector y which is our target variable (the flowers species):

X = np.array([df.SepalLengthCm, df.SepalWidthCm, df.PetalLengthCm, df.PetalWidthCm]).Ty = np.array(df.Species).reshape(-1,1)print(X.shape) -> Out: (150, 4)
print(y.shape) -> Out: (150, 1)

(Note: Trees don’t require feature scaling so we don’t worry about that.)

Once we have our data well arranged we can go ahead and create our decision tree classifier:

dtc = DecisionTreeClassifier(max_depth = 4, random_state=0)

We then split the dataset into training and test-set:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Fit the classifier to the training set:

dtc = dtc.fit(X_train, y_train)

And finally, score the classifier on the test-set to check the model accuracy:

score = dtc.score(X_test, y_test)
print(score) -> Out: 0.9736842105263158

Our model is complete: 0.97% accuracy, not bad at all for such a simple classifier!

This is what our tree looks like

fn=['sepal length','sepal width ','petal length ','petal width ']
cn=['Setosa', 'Versicolor', 'Virginica']
plt.figure(figsize=(10,10))
tree.plot_tree(dtc, feature_names = fn, class_names=cn, filled = True)
Fig 3 — Our classification tree

Using our model to make predictions

Given a new flower (instance) we are good to start predicting its species (class) just like so:

new_flower = np.array([5.5, 2.8, 5, 2]).reshape(1,-1)
prediction = dtc.predict(new_flower)
print(prediction) -> Out: 'Iris-virginica'

Our classifier seems to be working well and we’ve done a pretty good job for today!

Decision trees can be tuned (their hyperparameters), boosted and also combined to create random forests. It’s not over…

Coding a basic decision tree

All together, the code to implement this classification tree is:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv("Desktop/Iris.csv")X = np.array([df.SepalLengthCm, df.SepalWidthCm, df.PetalLengthCm, df.PetalWidthCm]).Ty = np.array(df.Species).reshape(-1,1)dtc = DecisionTreeClassifier(max_depth = 4, random_state=0)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)dtc = dtc.fit(X_train, y_train)score = dtc.score(X_test, y_test)
print(score)

--

--