Regularization in Logistic Regression

Saurav Agrawal
3 min readJun 15, 2023

--

Photo By Francesco Ungaro On Pexels

Logistic Regression is a machine learning linear classifier algorithm. It’s hyperparameter C is inverse of regularization strength. Higher C value means lower regularization. Smaller C value means higher regularization.

  1. Importing libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

2. Reading the PIMA Indians Diabetes CSV Dataset.

df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
print(df.head())
print("Shape of the Dataset: {}".format(df.shape))
Dataset Head and Shape

3. Preprocessing and separating the train and test data.

# Segregating the Feature and Target
X = df.drop("Outcome", axis=1).values
y = df["Outcome"].values

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

print("Shape of Train Features: {}".format(X_train.shape))
print("Shape of Test Features: {}".format(X_test.shape))
print("Shape of Train Target: {}".format(y_train.shape))
print("Shape of Test Target: {}".format(y_test.shape))
Shape of Train and Test Dataset

4. Logistic Regression with varying C values.

# list to record the accuracy
training_accuracy = []
testing_accuracy = []

# list to record the error
training_error = []
testing_error = []

# C Hyperparameter
C_param = np.linspace(0.001, 100, 100)

for C_value in C_param:
#Logistic Regression
lr = LogisticRegression(C=C_value)
lr.fit(X_train, y_train)

#Appending the Accuracy Score
training_accuracy.append(lr.score(X_train, y_train))
testing_accuracy.append(lr.score(X_test, y_test))

#Appending the Error Score
training_error.append(1-lr.score(X_train, y_train))
testing_error.append(1-lr.score(X_test, y_test))

4. Making the plot function.

def plot_log(X_values, train_values, test_values, Xlabel, ylabel, title):
plt.semilogx(X_values, train_values, X_values, test_values)
plt.xlabel(Xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.legend(("Train", "Test"))
plt.grid()
plt.show()

5. Plotting the accuracy.

plot_log(C_param, training_accuracy, testing_accuracy, "C Values", "Accuracy", "Logistic Regression Accuracy with varying C values")
Accuracy Plot

6. Plotting the error.

plot_log(C_param, training_error, testing_error, "C Values", "Error", "Logistic Regression Errors with varying C values")
Error Plot

Less Regularization lead to higher accuracy in the training dataset but lower accuracy in the testing dataset.

More Regularization lead to lower accuracy in the training dataset but higher accuracy in the testing dataset.

Regularization combats overfitting of the dataset. Large C leads to less regularization and small C leads to more regularization.

In the next article, we would discuss about the types of Regularization and the impact on the dataset.

I appreciate you and the time you took out of your day to read this!

Linkedin: https://www.linkedin.com/in/saurav-agrawal-137500214/

StackOverFlow: https://stackoverflow.com/users/11842006/saurav-agrawal

Email: agrawalsam1997@gmail.com

--

--

Saurav Agrawal

3x AWS Certified. Data Engineering, Machine Learning, Stocks and Finance. Buy me a coffee at https://www.buymeacoffee.com/SauravAgrawal