Regularization in Logistic Regression
Logistic Regression is a machine learning linear classifier algorithm. It’s hyperparameter C is inverse of regularization strength. Higher C value means lower regularization. Smaller C value means higher regularization.
- Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
2. Reading the PIMA Indians Diabetes CSV Dataset.
df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
print(df.head())
print("Shape of the Dataset: {}".format(df.shape))
3. Preprocessing and separating the train and test data.
# Segregating the Feature and Target
X = df.drop("Outcome", axis=1).values
y = df["Outcome"].values
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)
print("Shape of Train Features: {}".format(X_train.shape))
print("Shape of Test Features: {}".format(X_test.shape))
print("Shape of Train Target: {}".format(y_train.shape))
print("Shape of Test Target: {}".format(y_test.shape))
4. Logistic Regression with varying C values.
# list to record the accuracy
training_accuracy = []
testing_accuracy = []
# list to record the error
training_error = []
testing_error = []
# C Hyperparameter
C_param = np.linspace(0.001, 100, 100)
for C_value in C_param:
#Logistic Regression
lr = LogisticRegression(C=C_value)
lr.fit(X_train, y_train)
#Appending the Accuracy Score
training_accuracy.append(lr.score(X_train, y_train))
testing_accuracy.append(lr.score(X_test, y_test))
#Appending the Error Score
training_error.append(1-lr.score(X_train, y_train))
testing_error.append(1-lr.score(X_test, y_test))
4. Making the plot function.
def plot_log(X_values, train_values, test_values, Xlabel, ylabel, title):
plt.semilogx(X_values, train_values, X_values, test_values)
plt.xlabel(Xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.legend(("Train", "Test"))
plt.grid()
plt.show()
5. Plotting the accuracy.
plot_log(C_param, training_accuracy, testing_accuracy, "C Values", "Accuracy", "Logistic Regression Accuracy with varying C values")
6. Plotting the error.
plot_log(C_param, training_error, testing_error, "C Values", "Error", "Logistic Regression Errors with varying C values")
Less Regularization lead to higher accuracy in the training dataset but lower accuracy in the testing dataset.
More Regularization lead to lower accuracy in the training dataset but higher accuracy in the testing dataset.
Regularization combats overfitting of the dataset. Large C leads to less regularization and small C leads to more regularization.
In the next article, we would discuss about the types of Regularization and the impact on the dataset.
I appreciate you and the time you took out of your day to read this!
Linkedin: https://www.linkedin.com/in/saurav-agrawal-137500214/
StackOverFlow: https://stackoverflow.com/users/11842006/saurav-agrawal
Email: agrawalsam1997@gmail.com