Iris Flower Classification and Decision Boundary Plotting using Logistic Regression

Idriss Jairi
5 min readSep 4, 2021

--

In this new article, I am going to explain how to solve a classification problem using the scikit-learn library and the logistic regression algorithm.

Without further ado, let’s dive in!

Problem:

Building a machine learning model to distinguish between three categories( or classes) of the Iris flower.

Context:

The Iris flower data set or Fisher’s Iris data set is a famous dataset in machine learning.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

The three types of the IRIS flower.

Importing Libraries and Dataset:

Libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sn
from sklearn import metrics

Dataset

iris = datasets.load_iris()
Iris Dataset

Building Logistic Regression Model

iris_model = LogisticRegression(max_iter=1000)

Selecting the Features X and the Target Y

X = df_iris[iris['feature_names']]
y = df_iris['target']

Splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

Fitting the model

iris_model.fit(X_train, y_train)

Note that, you can get the intercept values and the values of the weights, by calling these two attributes;

print(iris_model.intercept_)
print(iris_model.coef_)

Making Predictions

y_pred = iris_model.predict(X_test)

Compute accuracy

iris_model.score(X_test, y_test)

Confusion Matrix

#Confusion Matrix
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
plt.show()
IRIS Confusion Matrix

Source code for this part:

Plotting the Decision Boundary

The decision boundary is the line that separates the area where y = 0, where y = 1, and where y = 2. It is created by our hypothesis function.

Decision Boundary for Binary Classification

In this example, I am going to consider only two classes, class 0 and 1 (setosa and Versicolor), and let’s consider just the first two features (sepal length (cm) and sepal width (cm))

we follow the same process above to compute the intercept and the weights, and once we do that we set our hypothesis function;

Hypothesis Function

To get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

Setting the Threshold

The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:

g(z) >= 0.5
where z >= 0

In our case z is defined as follows:

To draw the decision boundary straights line, we can set the equation equal to zero and solve it for sepal_W;

You can then select two different sepal_L values and draw a straight line between the two points;

Decision Boundary for Binary Classification
g =sn.scatterplot(x="sepal length (cm)",y="sepal width (cm)",
hue="target",
data=df_iris,palette=['green','red']);
x1 = (-weights[0] - weights[2]*df_iris.iloc[0,1])/weights[1]
x2 = (-weights[0] - weights[2]*df_iris.iloc[4,1])/weights[1]
plt.axline((x1, df_iris.iloc[0,1]), (x2, df_iris.iloc[4,1]), color = "green")

where df_iris.iloc[0,1] and df_iris.iloc[4,1] are two sepal width values from the dataset.

Source code for Binary Classification Decision Boundary:

Decision Boundary for Multiclass Classification

Now, let’s consider all the classes (0, 1, and 2), and let’s draw the decision boundaries for these three classes.

To visualize the decision boundary for these three classes, I am going to consider only two features (sepal length and sepal width).

After you run the logistic regression as we did in the first section, you can then print out the values of the coefficients including the intercept, you will get something as follow;

Our Model Coefficients

This means that the first array contains intercepts for the three classes (0, 1, and 2), and the same thing for the second array contains three arrays that are the coefficients for our selected two features. We can then draw the three decision boundaries as we did the binary classification boundary;

g =sn.scatterplot(x="sepal length (cm)",y="sepal width (cm)",
hue="target",
data=df_iris,palette=['green','red','blue']);
x1 = (-weights0[0] - weights0[2]*df_iris.iloc[0,1])/weights0[1]
x2 = (-weights0[0] - weights0[2]*df_iris.iloc[4,1])/weights0[1]
plt.axline((x1, df_iris.iloc[0,1]), (x2, df_iris.iloc[4,1]), color = "green")x1 = (-weights1[0] - weights1[2]*df_iris.iloc[0,1])/weights1[1]
x2 = (-weights1[0] - weights1[2]*df_iris.iloc[4,1])/weights1[1]
plt.axline((x1, df_iris.iloc[0,1]), (x2, df_iris.iloc[4,1]), color = "red")x1 = (-weights2[0] - weights2[2]*df_iris.iloc[0,1])/weights2[1]
x2 = (-weights2[0] - weights2[2]*df_iris.iloc[4,1])/weights2[1]
plt.axline((x1, df_iris.iloc[0,1]), (x2, df_iris.iloc[4,1]), color = "blue")
Multiclass Classification Decision Boundaries

Source code for Multiclass classification decision boundaries:

That is all for now, hope you like the article, If you have any questions or clarifications do not hesitate to contact me, thank you!

--

--