Simple Decision Tree Classifier using Python | Daily Python #23

Published in

Daily Python

6 min readJan 29, 2020

This article is a tutorial on how to implement a decision tree classifier using Python.

This article is a part of Daily Python challenge that I have taken up for myself. I will be writing short python articles daily.

Requirements:

Python 3.0
Pip

Install the following packages:

pandas — BSD-licensed library providing high-performance, easy-to-use data structures, and data analysis tools.
sklearn — provides dozens of built-in machine learning algorithms and models
graphviz— facilitates the creation, and rendering of graph descriptions in the DOT language

What is a Classifier?

According to Wikipedia — An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term “classifier” sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.

Then, what is a Decision Tree?

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

A decision tree consists of three types of nodes:

Decision nodes — typically represented by squares
Chance nodes — typically represented by circles
End nodes — typically represented by triangles

A Decision Tree Classifier classifies a given data into different classes depending on the tree developed using the training data.

Advantages of decision trees

Among decision support tools, decision trees (and influence diagrams) have several advantages. Decision trees:

Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
Help determine worst, best and expected values for different scenarios.
Use a white-box model. If a given result is provided by a model.
It can be combined with other decision techniques.

Disadvantages of decision trees:

They are unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree.
They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree.
For data including categorical variables with a different number of levels, information gain in decision trees is biased in favor of those attributes with more levels.
Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.

Chapter 3 : Decision Tree Classifier — Theory

Welcome to third basic classification algorithm of supervised learning. Decision Trees. Like previous chapters (Chapter…

medium.com

Go through the above article for a detailed explanation of the Decision Tree Classifier and the various methods which can be used to build a decision tree.

What is Sci-kit Learn?

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Data that we will be using for classification

Capital Share — Capital Bikeshare is metro DC’s bike-share service, with 4,300 bikes and 500+ stations across 7 jurisdictions: Washington, DC.; Arlington, VA; Alexandria, VA; Montgomery, MD; Prince George’s County, MD; Fairfax County, VA; and the City of Falls Church, VA. Designed for quick trips with convenience in mind, it’s a fun and affordable way to get around.

Trip History Data

Each quarter, we publish downloadable files of Capital Bikeshare trip data. The data includes:

Duration — Duration of trip
Start Date — Includes start date and time
End Date — Includes end date and time
Start Station — Includes starting station name and number
End Station — Includes ending station name and number
Bike Number — Includes ID number of bike used for the trip
Member Type — Indicates whether user was a “registered” member (Annual Member, 30-Day Member or Day Key Member) or a “casual” rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass)

This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of our “test” stations at our warehouses and any trips lasting less than 60 seconds (potentially false starts or users trying to re-dock a bike to ensure it’s secure).

Import the required libraries

# Importing the required packages 
import numpy as np 
import pandas as pd 
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 
import graphviz

Function to import the data from the CSV file

# Function importing Dataset 
def importdata(): 
 balance_data = pd.read_csv('capitalshare.csv') 
 balance_data = balance_data[['Duration','Start station number','End station number','Member type']]
 #print(balance_data)
 # Printing the dataswet shape 
 print ("Dataset Lenght: ", len(balance_data)) 
 print ("Dataset Shape: ", balance_data.shape) 
 
 # Printing the dataset obseravtions 
 print ("Dataset: ",balance_data.head()) 
 return balance_data

Function to split the data into training and test data

# Function to split the dataset 
def splitdataset(balance_data):# Seperating the target variable 
 X = balance_data.values[:, :-1] 
 Y = balance_data.values[:, -1]# Spliting the dataset into train and test 
 X_train, X_test, y_train, y_test = train_test_split( 
 X, Y, test_size = 0.3, random_state = 100) 
 
 return X_train, X_test, y_train, y_test

Function to visualize the developed tree model

#Function to visualize tree
def visualize_tree(data,clf,clf_name):
 features = data.columns
 features = features[:-1]
 class_names = list(set(data.iloc[:,-1]))
 dot_data = tree.export_graphviz(clf, out_file=None,  \
  feature_names=features,class_names=class_names,  \
  filled=True, rounded=True, special_characters=True)
 graph = graphviz.Source(dot_data)
 graph.render('dtree_render_'+clf_name,view=True)

Function to train the decision tree using Gini Index

# Function to perform training with giniIndex. 
def train_using_gini(X_train, X_test, y_train,data):# Creating the classifier object 
 clf_gini = DecisionTreeClassifier(criterion = "gini", 
   random_state = 100,max_depth=3, min_samples_leaf=5)
        # Performing training 
 clf_gini.fit(X_train, y_train)
 visualize_tree(data,clf_gini,'gini')
 print('\nFeature Importance : ',clf_gini.feature_importances_)
 return  clf_gini

Function to train the decision tree using Entropy

# Function to perform training with entropy. 
def tarin_using_entropy(X_train, X_test, y_train,data):# Decision tree with entropy 
 clf_entropy = DecisionTreeClassifier( 
   criterion = "entropy", random_state = 100, 
   max_depth = 3, min_samples_leaf = 5)# Performing training 
 clf_entropy.fit(X_train, y_train)
 visualize_tree(data,clf_entropy,'entropy')
 print('\nFeature Importance : ',clf_entropy.feature_importances_)
 return clf_entropy

Function to predict the class after training

# Function to make predictions 
def prediction(X_test, clf_object):# Predicton on test with giniIndex 
 y_pred = clf_object.predict(X_test) 
 print("Predicted values:") 
 print(y_pred) 
 return y_pred

Function to calculate accuracy

# Function to calculate accuracy 
def cal_accuracy(y_test, y_pred): 
 
 print("Confusion Matrix: ", 
  confusion_matrix(y_test, y_pred)) 
 
 print ("Accuracy : ", 
 accuracy_score(y_test,y_pred)*100) 
 
 print("Report : ", 
 classification_report(y_test, y_pred))

Main Process

# Main process   
def main(): 
 
 # Building Phase 
 data = importdata() 
 X_train, X_test, y_train, y_test = splitdataset(data) 
 clf_gini = train_using_gini(X_train, X_test, y_train,data) 
 clf_entropy = tarin_using_entropy(X_train, X_test, y_train,data) 
 
 # Operational Phase 
 print("Results Using Gini Index:") 
 
 # Prediction using gini 
 y_pred_gini = prediction(X_test, clf_gini) 
 cal_accuracy(y_test, y_pred_gini) 
 
 print("Results Using Entropy:") 
 # Prediction using entropy 
 y_pred_entropy = prediction(X_test, clf_entropy) 
 cal_accuracy(y_test, y_pred_entropy) 
 
 
# Calling main function 
if __name__=="__main__": 
 main()