Simple Decision Tree Classifier using Python | Daily Python #23
This article is a tutorial on how to implement a decision tree classifier using Python.
This article is a part of Daily Python challenge that I have taken up for myself. I will be writing short python articles daily.
- Python 3.0
Install the following packages:
- pandas — BSD-licensed library providing high-performance, easy-to-use data structures, and data analysis tools.
- sklearn — provides dozens of built-in machine learning algorithms and models
- graphviz— facilitates the creation, and rendering of graph descriptions in the DOT language
What is a Classifier?
According to Wikipedia — An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term “classifier” sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.
Then, what is a Decision Tree?
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
A decision tree consists of three types of nodes:
- Decision nodes — typically represented by squares
- Chance nodes — typically represented by circles
- End nodes — typically represented by triangles
A Decision Tree Classifier classifies a given data into different classes depending on the tree developed using the training data.
Advantages of decision trees
Among decision support tools, decision trees (and influence diagrams) have several advantages. Decision trees:
- Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
- Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
- Help determine worst, best and expected values for different scenarios.
- Use a white-box model. If a given result is provided by a model.
- It can be combined with other decision techniques.
Disadvantages of decision trees:
- They are unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree.
- They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree.
- For data including categorical variables with a different number of levels, information gain in decision trees is biased in favor of those attributes with more levels.
- Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.
Chapter 3 : Decision Tree Classifier — Theory
Welcome to third basic classification algorithm of supervised learning. Decision Trees. Like previous chapters (Chapter…
Go through the above article for a detailed explanation of the Decision Tree Classifier and the various methods which can be used to build a decision tree.
What is Sci-kit Learn?
Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Data that we will be using for classification
Capital Share — Capital Bikeshare is metro DC’s bike-share service, with 4,300 bikes and 500+ stations across 7 jurisdictions: Washington, DC.; Arlington, VA; Alexandria, VA; Montgomery, MD; Prince George’s County, MD; Fairfax County, VA; and the City of Falls Church, VA. Designed for quick trips with convenience in mind, it’s a fun and affordable way to get around.
Trip History Data
Each quarter, we publish downloadable files of Capital Bikeshare trip data. The data includes:
- Duration — Duration of trip
- Start Date — Includes start date and time
- End Date — Includes end date and time
- Start Station — Includes starting station name and number
- End Station — Includes ending station name and number
- Bike Number — Includes ID number of bike used for the trip
- Member Type — Indicates whether user was a “registered” member (Annual Member, 30-Day Member or Day Key Member) or a “casual” rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass)
This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of our “test” stations at our warehouses and any trips lasting less than 60 seconds (potentially false starts or users trying to re-dock a bike to ensure it’s secure).
Import the required libraries
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
Function to import the data from the CSV file
# Function importing Dataset
balance_data = pd.read_csv('capitalshare.csv')
balance_data = balance_data[['Duration','Start station number','End station number','Member type']]
# Printing the dataswet shape
print ("Dataset Lenght: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
Function to split the data into training and test data
# Function to split the dataset
def splitdataset(balance_data):# Seperating the target variable
X = balance_data.values[:, :-1]
Y = balance_data.values[:, -1]# Spliting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
return X_train, X_test, y_train, y_test
Function to visualize the developed tree model
#Function to visualize tree
features = data.columns
features = features[:-1]
class_names = list(set(data.iloc[:,-1]))
dot_data = tree.export_graphviz(clf, out_file=None, \
filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
Function to train the decision tree using Gini Index
# Function to perform training with giniIndex.
def train_using_gini(X_train, X_test, y_train,data):# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training
print('\nFeature Importance : ',clf_gini.feature_importances_)
Function to train the decision tree using Entropy
# Function to perform training with entropy.
def tarin_using_entropy(X_train, X_test, y_train,data):# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)# Performing training
print('\nFeature Importance : ',clf_entropy.feature_importances_)
Function to predict the class after training
# Function to make predictions
def prediction(X_test, clf_object):# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
Function to calculate accuracy
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
print ("Accuracy : ",
print("Report : ",
# Main process
# Building Phase
data = importdata()
X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train,data)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train,data)
# Operational Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
# Calling main function
I hope this article was helpful, do leave some claps if you liked it.
Follow the Daily Python Challenge here:
You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…