Simple Decision Tree Classifier using Python | Daily Python #23

Ajinkya Sonawane
Daily Python
Published in
6 min readJan 29, 2020

This article is a tutorial on how to implement a decision tree classifier using Python.

This article is a part of Daily Python challenge that I have taken up for myself. I will be writing short python articles daily.

Requirements:

  1. Python 3.0
  2. Pip

Install the following packages:

  1. pandas — BSD-licensed library providing high-performance, easy-to-use data structures, and data analysis tools.
  2. sklearn — provides dozens of built-in machine learning algorithms and models
  3. graphviz— facilitates the creation, and rendering of graph descriptions in the DOT language

What is a Classifier?

According to Wikipedia — An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term “classifier” sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.

Then, what is a Decision Tree?

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

A decision tree consists of three types of nodes:

  1. Decision nodes — typically represented by squares
  2. Chance nodes — typically represented by circles
  3. End nodes — typically represented by triangles

A Decision Tree Classifier classifies a given data into different classes depending on the tree developed using the training data.

Advantages of decision trees

Among decision support tools, decision trees (and influence diagrams) have several advantages. Decision trees:

  • Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
  • Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
  • Help determine worst, best and expected values for different scenarios.
  • Use a white-box model. If a given result is provided by a model.
  • It can be combined with other decision techniques.

Disadvantages of decision trees:

  • They are unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree.
  • They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree.
  • For data including categorical variables with a different number of levels, information gain in decision trees is biased in favor of those attributes with more levels.
  • Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.

Go through the above article for a detailed explanation of the Decision Tree Classifier and the various methods which can be used to build a decision tree.

What is Sci-kit Learn?

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Data that we will be using for classification

Capital Share — Capital Bikeshare is metro DC’s bike-share service, with 4,300 bikes and 500+ stations across 7 jurisdictions: Washington, DC.; Arlington, VA; Alexandria, VA; Montgomery, MD; Prince George’s County, MD; Fairfax County, VA; and the City of Falls Church, VA. Designed for quick trips with convenience in mind, it’s a fun and affordable way to get around.

Trip History Data

Each quarter, we publish downloadable files of Capital Bikeshare trip data. The data includes:

  • Duration — Duration of trip
  • Start Date — Includes start date and time
  • End Date — Includes end date and time
  • Start Station — Includes starting station name and number
  • End Station — Includes ending station name and number
  • Bike Number — Includes ID number of bike used for the trip
  • Member Type — Indicates whether user was a “registered” member (Annual Member, 30-Day Member or Day Key Member) or a “casual” rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass)

This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of our “test” stations at our warehouses and any trips lasting less than 60 seconds (potentially false starts or users trying to re-dock a bike to ensure it’s secure).

Snip of the data being used

Import the required libraries

# Importing the required packages 
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import graphviz

Function to import the data from the CSV file

# Function importing Dataset 
def importdata():
balance_data = pd.read_csv('capitalshare.csv')
balance_data = balance_data[['Duration','Start station number','End station number','Member type']]
#print(balance_data)
# Printing the dataswet shape
print ("Dataset Lenght: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)

# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
return balance_data

Function to split the data into training and test data

# Function to split the dataset 
def splitdataset(balance_data):
# Seperating the target variable
X = balance_data.values[:, :-1]
Y = balance_data.values[:, -1]
# Spliting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)

return X_train, X_test, y_train, y_test

Function to visualize the developed tree model

#Function to visualize tree
def visualize_tree(data,clf,clf_name):
features = data.columns
features = features[:-1]
class_names = list(set(data.iloc[:,-1]))
dot_data = tree.export_graphviz(clf, out_file=None, \
feature_names=features,class_names=class_names, \
filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.render('dtree_render_'+clf_name,view=True)

Function to train the decision tree using Gini Index

# Function to perform training with giniIndex. 
def train_using_gini(X_train, X_test, y_train,data):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
visualize_tree(data,clf_gini,'gini')
print('\nFeature Importance : ',clf_gini.feature_importances_)
return clf_gini

Function to train the decision tree using Entropy

# Function to perform training with entropy. 
def tarin_using_entropy(X_train, X_test, y_train,data):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
visualize_tree(data,clf_entropy,'entropy')
print('\nFeature Importance : ',clf_entropy.feature_importances_)
return clf_entropy

Function to predict the class after training

# Function to make predictions 
def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred

Function to calculate accuracy

# Function to calculate accuracy 
def cal_accuracy(y_test, y_pred):

print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))

print ("Accuracy : ",
accuracy_score(y_test,y_pred)*100)

print("Report : ",
classification_report(y_test, y_pred))

Main Process

# Main process   
def main():

# Building Phase
data = importdata()
X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train,data)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train,data)

# Operational Phase
print("Results Using Gini Index:")

# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)

print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)


# Calling main function
if __name__=="__main__":
main()
Snip of the output of the above code
Decision Tree built using Gini Index
Decision Tree built using Entropy

--

--