Decision Trees 2/2

5 min readJan 3, 2023

2/2 contains the code part of the Decision Trees Classifier Algorithm.

A quick recap of what we have learned in the Decision Trees 1/2.

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, along with probability estimates and costs, to predict the outcome of a decision. Decision tree classifiers are widely used in many areas, such as credit risk assessment, medical diagnosis, and image classification.

The decision tree classifier works by splitting the training data into smaller and smaller subsets based on the values of the features. The tree is built by making decisions based on the features of the data, and each internal node in the tree corresponds to a decision about the value of a particular feature. The leaves of the tree represent the final classification of the data.

To classify a new sample, the decision tree classifier starts at the root node and follows the path down the tree based on the values of the features in the sample. When it reaches a leaf node, it returns the class label associated with that leaf.

One advantage of decision tree classifiers is that they are easy to interpret and explain, as the decision-making process is explicitly represented in the tree structure. However, they can also be prone to overfitting, especially if the tree is allowed to grow too deep.

About Dataset

The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found in the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties of each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Target

What’s the point if we don’t show the proof, Enough Said.

Let’s Rock and Roll 👇🏻

# Importing all the neccessary libraries

import pandas as pd  # to analyze data
import numpy as np  # to perform a wide variety of mathematical operations on arrays
import seaborn as sns # to visualize random distributions.
import matplotlib.pyplot as plt # to perform data visualization and graphical plotting
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree

Load and Read Dataset

# Load and read an inbuild 'iris' dataset

iris = load_iris()
data_f = pd.DataFrame(data = np.c_[iris["data"], iris["target"]],columns = iris.feature_names+["target"])

# print first 5 records

data_f.head()

# print last 5 records

data_f.tail()

# let's assign the value for each target  0:setosa, 1:versicolor, 2:virginica

print(np.unique(iris.target))
iris.target_names

Target Variables

# let's define and return output with the right set

def names(a):
    if(a==0): return "Iris-setosa"
    elif(a==1): return "Iris-Versicolor"
    else: return "Iris-Virginica"
    
data_f["target_name"] = data_f["target"].apply(names)

data_f.head()

# let's check the nulls if there are any

data_f.isnull().sum()

# print the shape of the dataset

data_f.shape   # 150 rows and 6 columns

(150, 6)

# let's separate independant variables and dependant variable

data = data_f.values

x = data[:,0:4]
y = data[:,4]

y_data = np.array([np.average(x[:,i][y==j]) for i in range(x.shape[1]) for j in (np.unique(y))])
y_data = y_data.reshape(4,3)
y_data = np.swapaxes(y_data, 0,1)

x_axis = np.arange(len(data_f.columns)-2)
width = 0.2

Visual Represantation

# let's visualize the length and width features

plt.figure(figsize =(12,8))
plt.bar(x_axis, y_data[0], width, label = "Setosa")
plt.bar(x_axis+width, y_data[1], width, label = "Versicolour")
plt.bar(x_axis+width*2, y_data[2], width, label = "Verginica")

plt.xticks(x_axis, ["Sepal length", "Sepal width", "Petal length", "Petal width"])
plt.xlabel("Features")
plt.ylabel("Value(cm).")
plt.legend(bbox_to_anchor=(1,1))

plt.show()

Train Test Split

# let's split the data into train and test

x = data_f.drop(["target", "target_name"], axis = "columns")
y = data_f["target"]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train , y_test = train_test_split(x, y, test_size = .2, random_state = 420)

Decision Tree Classifier Model

model_dt = DecisionTreeClassifier()


# let's fit the model
model_dt.fit(x_train, y_train)

DecisionTreeClassifier()

# let's set the prediction model

model_dt.predict(x_test)

Predictive Results

# let's check the model score

model_dt.score(x_test, y_test)

Accuracy of Predictive Class

Cross Validation Score Function

from sklearn.model_selection import cross_val_score, ShuffleSplit

cv = ShuffleSplit(n_splits=6, test_size = .2)

arr = cross_val_score(DecisionTreeClassifier(), x, y, cv= cv)

print(list(arr))
print("Average Score:", np.mean(arr))

Average Accuracy Score

Decision Tree Visualization

fig = plt.figure(figsize=(25,20))
fig = plot_tree(model_dt, feature_names = iris.feature_names, class_names = iris.target_names,
               rounded = True,
               filled = True)

Graphical Representation of Decision Tree Classifier

Interpretation

Less the Gini Score better the model, however the Decision Tree Classifier Graph representation we can interpret that the ‘Petal Width’ has more variance than the other variables hence trees getting splits with Petal Width.

GitHub Repository: https://github.com/KVishwas98/TSF-Task6-Prediciton-using-Decision-Tree-Algorithm

Conclusion

If you find any difficulty in following the code part, mention them in the comment section.

Thank you for reading! Let me know in a comment or on LinkedIn if you felt like this did or didn’t help. I’ve got a few more articles that I’m writing and will be posting them every couple of weeks. They are mostly accounts from my project experience. If there are any other questions or anything else you’d like to hear about, please don’t hesitate to put in a request.