Iris Flower Classification and Decision Boundary Plotting using Logistic Regression
In this new article, I am going to explain how to solve a classification problem using the scikit-learn library and the logistic regression algorithm.
Without further ado, let’s dive in!
Problem:
Building a machine learning model to distinguish between three categories( or classes) of the Iris flower.
Context:
The Iris flower data set or Fisher’s Iris data set is a famous dataset in machine learning.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
Importing Libraries and Dataset:
Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sn
from sklearn import metrics
Dataset
iris = datasets.load_iris()
Building Logistic Regression Model
iris_model = LogisticRegression(max_iter=1000)
Selecting the Features X and the Target Y
X = df_iris[iris['feature_names']]
y = df_iris['target']
Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
Fitting the model
iris_model.fit(X_train, y_train)
Note that, you can get the intercept values and the values of the weights, by calling these two attributes;
print(iris_model.intercept_)
print(iris_model.coef_)
Making Predictions
y_pred = iris_model.predict(X_test)
Compute accuracy
iris_model.score(X_test, y_test)
Confusion Matrix
#Confusion Matrix
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
plt.show()
Source code for this part:
Plotting the Decision Boundary
The decision boundary is the line that separates the area where y = 0, where y = 1, and where y = 2. It is created by our hypothesis function.
Decision Boundary for Binary Classification
In this example, I am going to consider only two classes, class 0 and 1 (setosa and Versicolor), and let’s consider just the first two features (sepal length (cm) and sepal width (cm))
we follow the same process above to compute the intercept and the weights, and once we do that we set our hypothesis function;
To get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z) >= 0.5
where z >= 0
In our case z is defined as follows:
To draw the decision boundary straights line, we can set the equation equal to zero and solve it for sepal_W;
You can then select two different sepal_L values and draw a straight line between the two points;
g =sn.scatterplot(x="sepal length (cm)",y="sepal width (cm)",
hue="target",
data=df_iris,palette=['green','red']);x1 = (-weights[0] - weights[2]*df_iris.iloc[0,1])/weights[1]
x2 = (-weights[0] - weights[2]*df_iris.iloc[4,1])/weights[1]plt.axline((x1, df_iris.iloc[0,1]), (x2, df_iris.iloc[4,1]), color = "green")
where df_iris.iloc[0,1] and df_iris.iloc[4,1] are two sepal width values from the dataset.
Source code for Binary Classification Decision Boundary:
Decision Boundary for Multiclass Classification
Now, let’s consider all the classes (0, 1, and 2), and let’s draw the decision boundaries for these three classes.
To visualize the decision boundary for these three classes, I am going to consider only two features (sepal length and sepal width).
After you run the logistic regression as we did in the first section, you can then print out the values of the coefficients including the intercept, you will get something as follow;
This means that the first array contains intercepts for the three classes (0, 1, and 2), and the same thing for the second array contains three arrays that are the coefficients for our selected two features. We can then draw the three decision boundaries as we did the binary classification boundary;
g =sn.scatterplot(x="sepal length (cm)",y="sepal width (cm)",
hue="target",
data=df_iris,palette=['green','red','blue']);x1 = (-weights0[0] - weights0[2]*df_iris.iloc[0,1])/weights0[1]
x2 = (-weights0[0] - weights0[2]*df_iris.iloc[4,1])/weights0[1]plt.axline((x1, df_iris.iloc[0,1]), (x2, df_iris.iloc[4,1]), color = "green")x1 = (-weights1[0] - weights1[2]*df_iris.iloc[0,1])/weights1[1]
x2 = (-weights1[0] - weights1[2]*df_iris.iloc[4,1])/weights1[1]plt.axline((x1, df_iris.iloc[0,1]), (x2, df_iris.iloc[4,1]), color = "red")x1 = (-weights2[0] - weights2[2]*df_iris.iloc[0,1])/weights2[1]
x2 = (-weights2[0] - weights2[2]*df_iris.iloc[4,1])/weights2[1]plt.axline((x1, df_iris.iloc[0,1]), (x2, df_iris.iloc[4,1]), color = "blue")
Source code for Multiclass classification decision boundaries:
That is all for now, hope you like the article, If you have any questions or clarifications do not hesitate to contact me, thank you!