Decision Trees 2/2
2/2 contains the code part of the Decision Trees Classifier Algorithm.
A quick recap of what we have learned in the Decision Trees 1/2.
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, along with probability estimates and costs, to predict the outcome of a decision. Decision tree classifiers are widely used in many areas, such as credit risk assessment, medical diagnosis, and image classification.
The decision tree classifier works by splitting the training data into smaller and smaller subsets based on the values of the features. The tree is built by making decisions based on the features of the data, and each internal node in the tree corresponds to a decision about the value of a particular feature. The leaves of the tree represent the final classification of the data.
To classify a new sample, the decision tree classifier starts at the root node and follows the path down the tree based on the values of the features in the sample. When it reaches a leaf node, it returns the class label associated with that leaf.
One advantage of decision tree classifiers is that they are easy to interpret and explain, as the decision-making process is explicitly represented in the tree structure. However, they can also be prone to overfitting, especially if the tree is allowed to grow too deep.
About Dataset
The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found in the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties of each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
- SepalLengthCm
- SepalWidthCm
- PetalLengthCm
- PetalWidthCm
- Target
What’s the point if we don’t show the proof, Enough Said.
Let’s Rock and Roll 👇🏻
# Importing all the neccessary libraries
import pandas as pd # to analyze data
import numpy as np # to perform a wide variety of mathematical operations on arrays
import seaborn as sns # to visualize random distributions.
import matplotlib.pyplot as plt # to perform data visualization and graphical plotting
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
Load and Read Dataset
# Load and read an inbuild 'iris' dataset
iris = load_iris()
data_f = pd.DataFrame(data = np.c_[iris["data"], iris["target"]],columns = iris.feature_names+["target"])
# print first 5 records
data_f.head()
# print last 5 records
data_f.tail()
# let's assign the value for each target 0:setosa, 1:versicolor, 2:virginica
print(np.unique(iris.target))
iris.target_names
# let's define and return output with the right set
def names(a):
if(a==0): return "Iris-setosa"
elif(a==1): return "Iris-Versicolor"
else: return "Iris-Virginica"
data_f["target_name"] = data_f["target"].apply(names)
data_f.head()
# let's check the nulls if there are any
data_f.isnull().sum()
# print the shape of the dataset
data_f.shape # 150 rows and 6 columns
(150, 6)
# let's separate independant variables and dependant variable
data = data_f.values
x = data[:,0:4]
y = data[:,4]
y_data = np.array([np.average(x[:,i][y==j]) for i in range(x.shape[1]) for j in (np.unique(y))])
y_data = y_data.reshape(4,3)
y_data = np.swapaxes(y_data, 0,1)
x_axis = np.arange(len(data_f.columns)-2)
width = 0.2
Visual Represantation
# let's visualize the length and width features
plt.figure(figsize =(12,8))
plt.bar(x_axis, y_data[0], width, label = "Setosa")
plt.bar(x_axis+width, y_data[1], width, label = "Versicolour")
plt.bar(x_axis+width*2, y_data[2], width, label = "Verginica")
plt.xticks(x_axis, ["Sepal length", "Sepal width", "Petal length", "Petal width"])
plt.xlabel("Features")
plt.ylabel("Value(cm).")
plt.legend(bbox_to_anchor=(1,1))
plt.show()
Train Test Split
# let's split the data into train and test
x = data_f.drop(["target", "target_name"], axis = "columns")
y = data_f["target"]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train , y_test = train_test_split(x, y, test_size = .2, random_state = 420)
Decision Tree Classifier Model
model_dt = DecisionTreeClassifier()
# let's fit the model
model_dt.fit(x_train, y_train)
DecisionTreeClassifier()
# let's set the prediction model
model_dt.predict(x_test)
# let's check the model score
model_dt.score(x_test, y_test)
Cross Validation Score Function
from sklearn.model_selection import cross_val_score, ShuffleSplit
cv = ShuffleSplit(n_splits=6, test_size = .2)
arr = cross_val_score(DecisionTreeClassifier(), x, y, cv= cv)
print(list(arr))
print("Average Score:", np.mean(arr))
Decision Tree Visualization
fig = plt.figure(figsize=(25,20))
fig = plot_tree(model_dt, feature_names = iris.feature_names, class_names = iris.target_names,
rounded = True,
filled = True)
Interpretation
Less the Gini Score better the model, however the Decision Tree Classifier Graph representation we can interpret that the ‘Petal Width’ has more variance than the other variables hence trees getting splits with Petal Width.
GitHub Repository: https://github.com/KVishwas98/TSF-Task6-Prediciton-using-Decision-Tree-Algorithm
Conclusion
If you find any difficulty in following the code part, mention them in the comment section.
Thank you for reading! Let me know in a comment or on LinkedIn if you felt like this did or didn’t help. I’ve got a few more articles that I’m writing and will be posting them every couple of weeks. They are mostly accounts from my project experience. If there are any other questions or anything else you’d like to hear about, please don’t hesitate to put in a request.