Regularization in Logistic Regression

8 min readMay 30, 2023

View the accompanying Colab notebook.

In my previous article, we explored the basics of logistic regression. Now, we’ll focus on regularization, a technique used to prevent overfitting and improve model generalization in logistic regression.

In this second part of the series, we’ll discuss L1 and L2 regularization, explain the concept of convexity in the context of regularization, and provide guidance on choosing the appropriate regularization technique. We’ll also provide examples, illustrations, and visual aids to help you better understand these concepts.

Regularization: A Brief Overview

Regularization is a technique that adds a penalty term to the cost function, which measures how well the model is performing. This penalty term helps control the size of the coefficients (also called weights) in the model. Smaller coefficients usually result in a simpler model that is less likely to overfit. Regularization helps to balance the trade-off between model complexity and model fit, preventing the model from relying too heavily on any single predictor variable.

There are different types of regularization terms, such as L1 and L2. These terms are calculated using p-norms, which have different effects on the model coefficients.

P-Norms and Regularization

P-norms are a family of mathematical functions used to calculate the “size” or “magnitude” of a vector. In the context of regularization, p-norms are used to calculate the penalty term based on the coefficients of the model. The most common p-norms used in regularization are the L1-norm (p=1) and the L2-norm (p=2).

The L1-norm is calculated as the sum of the absolute values of the coefficients, while the L2-norm is calculated as the square root of the sum of the squares of the coefficients. Different p-norms have different effects on the model coefficients, leading to different regularization techniques.

L1, L2, and Elastic Net Regularization

In regularization techniques, the penalty term is controlled by a parameter called lambda (λ). Lambda determines the strength of the regularization effect, with larger values resulting in stronger regularization and smaller values resulting in weaker regularization. By adjusting lambda, we can control the balance between model complexity and model fit. In this section, we will discuss L1, L2, and Elastic Net regularization techniques and their relationship with lambda.

L1 Regularization (Lasso)

L1 regularization, also known as Lasso, uses the L1-norm as the penalty term. The L1-norm is calculated as the sum of the absolute values of the coefficients. Mathematically, the L1 penalty term is defined as:

L1_penalty = lambda * sum(abs(coefficients))

L1 regularization tends to create sparser models, meaning some coefficients will be exactly zero, effectively removing certain features from the model. This can be useful for feature selection, especially when there are many irrelevant features or when interpretability is important.

L2 Regularization (Ridge)

L2 regularization, also known as Ridge, uses the L2-norm as the penalty term. The L2-norm is calculated as the square root of the sum of the squares of the coefficients. Mathematically, the L2 penalty term is defined as:

L2_penalty = lambda * sum(coefficients ** 2)

L2 regularization tends to create models with smaller coefficients but does not force them to be exactly zero. This results in a more balanced model that still considers all features but with reduced importance.

Elastic Net Regularization

Elastic Net regularization is a combination of L1 and L2 regularization. It uses both the L1-norm and the L2-norm as penalty terms, with a mixing parameter (alpha) to control the balance between the two. Mathematically, the Elastic Net penalty term is defined as:

ElasticNet_penalty = alpha * L1_penalty + (1 - alpha) * L2_penalty

Elastic Net regularization can provide a balance between the sparsity of L1 regularization and the smoothness of L2 regularization, making it a useful option when it’s unclear which regularization technique to use. By adjusting the values of lambda and alpha, we can fine-tune the regularization effect to achieve the desired balance between model complexity and fit.

Intuition Behind L1 and L2 Effects on Coefficients

L1 regularization penalizes the model for having disproportionately large coefficients, after appropriately scaling the predictor variables. This encourages the model to rely on a smaller set of predictor variables, leading to sparser models with some coefficients being exactly zero.

L2 regularization, on the other hand, adds a “cost” to the model for having large squared coefficients. This encourages the model to distribute the importance of predictor variables more evenly across all features, resulting in smaller coefficients but not necessarily zero.

Convexity and Regularization

As we know, the goal of logistic regression is to minimize the cost function. This means we want to find the best combination of the model’s coefficients that results in the lowest cost. Doing this helps train the model to make accurate predictions on new, unseen data.

Convexity is a mathematical property that describes the shape of the cost function in logistic regression. A cost function is said to be convex if it has a single global minimum, meaning that there is only one set of coefficients that minimizes the cost function. Convexity is an important property because it guarantees that the optimization algorithm used to minimize the cost function will converge to the global minimum, regardless of the coefficients the algorithm starts with.

L2 regularization induces convexity in the cost function by adding a quadratic penalty term, which makes the cost function smoother and easier to optimize. L1 regularization, on the other hand, does not induce convexity but can still lead to sparse solutions.

Challenges with Non-Convex Cost Functions

Optimizing non-convex cost functions, such as those resulting from L1 regularization, can be more challenging due to the presence of multiple local minima, which means that the optimization algorithm may converge to a suboptimal solution — which is a fancy way of saying that your trained model may not be as accurate as it could be.

In practice, however, L1 regularization often leads to sparse solutions that are still useful for feature selection and model interpretation.

Choosing the Appropriate Regularization Technique

The choice of regularization technique depends on the specific problem and the desired properties of the model, such as sparsity or simplicity. Here are some guidelines to help you choose the appropriate regularization technique:

If you have many irrelevant features or need a more interpretable model, consider using L1 regularization, as it creates sparser models with some coefficients being exactly zero.
If you want a more balanced model that considers all features but with reduced importance, consider using L2 regularization, as it creates models with smaller coefficients but not necessarily zero.
If you’re unsure which technique to use, consider trying Elastic Net regularization, which combines L1 and L2 regularization and allows you to control the balance between the two.
Compare the performance of different regularization techniques using cross-validation to choose the best one for your specific problem.

The Role of the Regularization Strength Parameter (Lambda)

The regularization strength parameter (lambda) controls the balance between model complexity and fit. A larger lambda value results in a stronger regularization effect, leading to smaller coefficients and a simpler model. Conversely, a smaller lambda value results in a weaker regularization effect, allowing the model to fit the data more closely. Tuning the lambda parameter is crucial for finding the optimal balance between model complexity and fit.

Implementing Regularization in Scikit-learn

Scikit-learn is a popular machine learning library in Python that provides built-in support for logistic regression with regularization. The LogisticRegression class allows you to specify the type of regularization (L1, L2, or Elastic Net) and the regularization strength (C, which is the inverse of lambda) when creating a new model. Here’s an example of how to create a logistic regression model with L1 regularization and a specific regularization strength:

from sklearn.linear_model import LogisticRegression

# Create a logistic regression model with L1 regularization
# and a specific regularization strength.
model = LogisticRegression(penalty='l1', C=1.0, solver='liblinear')

Cross-validation is a model evaluation technique used to assess the performance of a machine learning model on unseen data. It involves dividing the dataset into multiple subsets, or “folds,” and iteratively training the model on different combinations of these folds while testing its performance on the remaining fold.

This process is repeated for each fold, and the model’s performance is averaged across all iterations. Cross-validation helps to estimate the model’s generalization ability and can be used to compare the performance of different regularization techniques. By selecting the regularization technique that yields the best cross-validation performance, you can ensure that your model is less likely to overfit and will perform well on new, unseen data.

The Python code for cross-validation typically looks something like this:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# When using regularization, you should scale the features first.
# Here's how:
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Create a logistic regression model with L1 regularization
# and a specific regularization strength.
model = LogisticRegression(penalty='l1', C=1.0, solver='liblinear')

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

# Calculate the average cross-validation score
mean_cv_score = cv_scores.mean()

print("Average 5-fold cross-validation score:", mean_cv_score)

You can test different settings using this code, but it’s more efficient to run a grid search, which tries different combinations of hyperparameters to determine the best model:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a logistic regression model
model = LogisticRegression(solver='saga', max_iter=5000)

# Create a list of regularization strengths and penalties to test
params = {
    'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
    'penalty': [None, 'l1', 'l2', 'elasticnet'],
    'l1_ratio': [.25, .50, .75] # this is 'alpha' as described earlier
}

# Perform 5-fold cross-validation with different regularization strengths and penalties
clf = GridSearchCV(model, params, cv=5, scoring='accuracy')
clf.fit(X_scaled, y)

# Print the best regularization strength and penalty
print("Best regularization strength:", clf.best_params_['C'])
print("Best penalty:", clf.best_params_['penalty'])

if clf.best_params_['penalty'] == 'elasticnet':
    print("Best alpha:", clf.best_params_['l1_ratio'])

Conclusion

In this second part of my logistic regression series, we delved into regularization — discussing L1 and L2 regularization, and explored the concept of convexity in the context of regularization.

Regularization has numerous real-world applications, and you’ll find it in just about every domain where machine learning and deep learning models are used. This includes computer vision, natural language processing, speech recognition, and many more areas. By using regularization techniques, we can create more robust and generalizable models that perform well on new, unseen data.

In the next part of this series, we’ll explore the practical aspects of implementing a logistic regression, including handling nonlinear relationships, addressing multicollinearity, and diving deeper into the use of grid search to create optimal models.

My complete series on logistic regression:

Logistic Regression: An Introduction
Regularization in Logistic Regression
Advanced Techniques in Logistic Regression — Part 1
Advanced Techniques for Logistic Regression and Classification — Part 2
Mastering Logistic Regression in Python with StatsModels
Colab Notebook