Logistic Regression on IRIS Dataset

Vijay Gautam
5 min readApr 18, 2020

--

Logistic Regression implementation on IRIS Dataset using the Scikit-learn library.

Logistic Regression is a supervised classification algorithm. Although the name says regression, it is a classification algorithm. Logistic regression measures the relationship between one or more independent variables (X) and the categorical dependent variable (Y) by estimating probabilities using a logistic(sigmoid) function. The term “Regression” comes because it estimates the probability of class membership or simply it is regressing for the probability of a categorical outcome.

Why Linear Regression is not used for a classification problem even it can regress the probability of a categorical outcome?

The main reasons why Linear regression is not used for classification. Classification needs probability belonging to the class, and it should be in the range between 0 and 1, while linear regression does not bound the predicted probability outcome in range. The linear regression model can generate the predicted probability as any number ranging from negative to positive infinity. Here logistic/sigmoid function comes in action. The logistic/sigmoid function is used because of its behavior of compressing the output between 0 and 1.

Sigmoid Function
Sigmoid Function

Logistic Regression on IRIS :

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Loading dataset :

# Importing the dataset
dataset = pd.read_csv(‘iris.csv’)
dataset.describe()
Output----------------------------------
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Splitting the dataset into the Training set and Test set :

# Splitting the dataset into the Training set and Test set
X = dataset.iloc[:, [0,1,2, 3]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Feature Scaling :

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Fitting Logistic Regression to the Training set :

# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, solver='lbfgs', multi_class='auto')
classifier.fit(X_train, y_train)
Output----------------------------------
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
tol=0.0001, verbose=0, warm_start=False)

Predicting the Test set results :

# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Predict probabilities
probs_y=classifier.predict_proba(X_test)
### Print results
probs_y = np.round(probs_y, 2)
res = "{:<10} | {:<10} | {:<10} | {:<13} | {:<5}".format("y_test", "y_pred", "Setosa(%)", "versicolor(%)", "virginica(%)\n")
res += "-"*65+"\n"
res += "\n".join("{:<10} | {:<10} | {:<10} | {:<13} | {:<10}".format(x, y, a, b, c) for x, y, a, b, c in zip(y_test, y_pred, probs_y[:,0], probs_y[:,1], probs_y[:,2]))
res += "\n"+"-"*65+"\n"
print(res)
Output----------------------------------y_test | y_pred | Setosa(%) | versicolor(%) | virginica(%)
-----------------------------------------------------------------
virginica | virginica | 0.0 | 0.03 | 0.97
versicolor | versicolor | 0.01 | 0.95 | 0.04
setosa | setosa | 1.0 | 0.0 | 0.0
virginica | virginica | 0.0 | 0.08 | 0.92
setosa | setosa | 0.98 | 0.02 | 0.0
virginica | virginica | 0.0 | 0.01 | 0.99
setosa | setosa | 0.98 | 0.02 | 0.0
versicolor | versicolor | 0.01 | 0.71 | 0.28
versicolor | versicolor | 0.0 | 0.73 | 0.27
versicolor | versicolor | 0.02 | 0.89 | 0.08
virginica | virginica | 0.0 | 0.44 | 0.56
versicolor | versicolor | 0.02 | 0.76 | 0.22
versicolor | versicolor | 0.01 | 0.85 | 0.13
versicolor | versicolor | 0.0 | 0.69 | 0.3
versicolor | versicolor | 0.01 | 0.75 | 0.24
setosa | setosa | 0.95 | 0.05 | 0.0
versicolor | versicolor | 0.02 | 0.72 | 0.26
versicolor | versicolor | 0.03 | 0.86 | 0.11
setosa | setosa | 0.94 | 0.06 | 0.0
setosa | setosa | 0.99 | 0.01 | 0.0
virginica | virginica | 0.0 | 0.17 | 0.83
versicolor | versicolor | 0.04 | 0.71 | 0.25
setosa | setosa | 0.98 | 0.02 | 0.0
setosa | setosa | 0.96 | 0.04 | 0.0
virginica | virginica | 0.0 | 0.35 | 0.65
setosa | setosa | 1.0 | 0.0 | 0.0
setosa | setosa | 0.99 | 0.01 | 0.0
versicolor | versicolor | 0.02 | 0.87 | 0.11
versicolor | versicolor | 0.09 | 0.9 | 0.02
setosa | setosa | 0.97 | 0.03 | 0.0
virginica | virginica | 0.0 | 0.21 | 0.79
versicolor | versicolor | 0.06 | 0.69 | 0.25
setosa | setosa | 0.98 | 0.02 | 0.0
virginica | virginica | 0.0 | 0.35 | 0.65
virginica | virginica | 0.0 | 0.04 | 0.96
versicolor | versicolor | 0.07 | 0.81 | 0.11
setosa | setosa | 0.97 | 0.03 | 0.0
versicolor | virginica | 0.0 | 0.42 | 0.58
-----------------------------------------------------------------

Note: Sci-Kit learn is using a default threshold 0.5 for binary classifications.

Making the Confusion Matrix :

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
Output----------------------------------
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]

Plot confusion matrix :

# Plot confusion matrix
import seaborn as sns
import pandas as pd
# confusion matrix sns heatmap
ax = plt.axes()
df_cm = cm
sns.heatmap(df_cm, annot=True, annot_kws={"size": 30}, fmt='d',cmap="Blues", ax = ax )
ax.set_title('Confusion Matrix')
plt.show()

The code is available on my Github.

References

  1. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  2. Logistic Regression Wiki https://en.wikipedia.org/wiki/Logistic_regression
  3. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

Thanks for reading. :)

--

--