Demystifying L1 and L2 Regularization in Machine Learning

Taming the Overfitting Beast

4 min readJan 29, 2024

Source: https://jashrathod.github.io/2021-09-30-underfitting-overfitting-and-regularization/

This article is part of the series Demystifying Machine Learning.

Introduction

In the world of machine learning, regularization is a critical concept that helps in enhancing the performance of predictive models. Among the different regularization techniques, L1 and L2 regularization are notably prominent. This blog post aims to demystify these concepts, explaining what they are, why and when to use them, and also when they might not be necessary. We’ll also dive into a Python example to compare models with and without regularization.

What are L1 and L2 Regularization?

L1 Regularization (Lasso)

L1 regularization, also known as Lasso Regression, involves adding a penalty equivalent to the absolute value of the magnitude of coefficients. This technique can lead to zero coefficients, meaning it can help in feature selection.

L2 Regularization (Ridge)

L2 regularization, or Ridge Regression, adds a penalty equivalent to the square of the magnitude of coefficients. Unlike L1, it doesn’t result in zero coefficients; instead, it only minimizes their impact.

Why Use Regularization?

Regularization is used to prevent overfitting, a common issue in machine learning where a model performs well on training data but poorly on unseen data. By adding a penalty, regularization techniques reduce the complexity of the model, making it better at generalizing from the training data.

How Do They Work?

Both L1 and L2 regularization work by adding a penalty term to the loss function:

- L1: Loss = Original Loss + λ * Σ|coefficients|
- L2: Loss = Original Loss + λ * Σ(coefficients)²

Here, λ (lambda) is a regularization parameter that controls the strength of the penalty.

When to Use Them and When Not To

Use L1 Regularization when:

You have a high number of features.
Feature selection is important.

Use L2 Regularization when:

You have multicollinearity in your data.
You want to prevent overfitting but keep all features.

Avoid regularization when:

The dataset is small, and overfitting is not a concern.
The model is already simple, or you have domain knowledge confirming all features are important.

Why Sometimes the Model Performs Better Without Regularization

In some cases, models might perform better without regularization, especially if the data is not complex or the model is not prone to overfitting. Regularization can potentially introduce bias, and if the underlying data set is simple, this bias might outweigh the variance reduction benefit.

Python Example

Let’s walk through a Python example where we compare a regularized model using both L1 and L2 regularization with a non-regularized model.

The code is available in this colab notebook:

import autograd.numpy as np
from autograd import grad
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
X = np.random.rand(100, 1) * 2 - 1  # Input data in the range [-1, 1]
y = np.sin(np.pi * X).squeeze() + np.random.randn(100) * 0.1  # sin(pi * x) with noise

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize weights
def initialize_weights(input_size, hidden_size, output_size):
    W1 = np.random.randn(input_size, hidden_size)
    b1 = np.zeros(hidden_size)
    W2 = np.random.randn(hidden_size, output_size)
    b2 = np.zeros(output_size)
    return W1, b1, W2, b2

# Neural network forward pass
def forward(X, W1, b1, W2, b2):
    hidden = np.tanh(np.dot(X, W1) + b1)
    output = np.dot(hidden, W2) + b2
    return output

# Loss function with optional L1 and L2 regularization
def loss_function(W1, b1, W2, b2, X, y, l1_penalty=0, l2_penalty=0):
    y_pred = forward(X, W1, b1, W2, b2)
    mse = np.mean((y - y_pred.squeeze()) ** 2)
    l1_reg = l1_penalty * (np.sum(np.abs(W1)) + np.sum(np.abs(W2)))
    l2_reg = l2_penalty * (np.sum(W1 ** 2) + np.sum(W2 ** 2))
    return mse + l1_reg + l2_reg

# Train the model
def fit(X, y, lr=0.01, epochs=500, hidden_size=10, l1_penalty=0, l2_penalty=0):
    input_size, output_size = X.shape[1], 1
    W1, b1, W2, b2 = initialize_weights(input_size, hidden_size, output_size)
    gradient_function = grad(loss_function, argnum=[0, 1, 2, 3])  # Compute gradients

    for epoch in range(epochs):
        grad_W1, grad_b1, grad_W2, grad_b2 = gradient_function(W1, b1, W2, b2, X, y, l1_penalty, l2_penalty)
        W1 -= lr * grad_W1
        b1 -= lr * grad_b1
        W2 -= lr * grad_W2
        b2 -= lr * grad_b2

    return W1, b1, W2, b2

# Train models
W1_no_reg, b1_no_reg, W2_no_reg, b2_no_reg = fit(X_train, y_train, hidden_size=10)
W1_l1, b1_l1, W2_l1, b2_l1 = fit(X_train, y_train, hidden_size=10, l1_penalty=0.01)
W1_l2, b1_l2, W2_l2, b2_l2 = fit(X_train, y_train, hidden_size=10, l2_penalty=0.01)

# Prediction function
def predict(X, W1, b1, W2, b2):
    return forward(X, W1, b1, W2, b2).squeeze()

# Predictions
y_pred_no_reg = predict(X_test, W1_no_reg, b1_no_reg, W2_no_reg, b2_no_reg)
y_pred_l1 = predict(X_test, W1_l1, b1_l1, W2_l1, b2_l1)
y_pred_l2 = predict(X_test, W1_l2, b1_l2, W2_l2, b2_l2)

# Calculate MSE for each model
mse_no_reg = np.mean((y_test - y_pred_no_reg) ** 2)
mse_l1 = np.mean((y_test - y_pred_l1) ** 2)
mse_l2 = np.mean((y_test - y_pred_l2) ** 2)

print("MSE without regularization:", mse_no_reg)
print("MSE with L1 regularization:", mse_l1)
print("MSE with L2 regularization:", mse_l2)

# Visualization
plt.scatter(X_test, y_test, color='black', label='Data')

# Sort X_test and corresponding predictions for proper line plotting
sorted_indices = np.argsort(X_test[:, 0])
X_test_sorted = X_test[sorted_indices]

# Plot predictions with lines
plt.plot(X_test_sorted, y_pred_no_reg[sorted_indices], color='green', linestyle='-', label='No Regularization')
plt.plot(X_test_sorted, y_pred_l1[sorted_indices], color='blue', linestyle='-', label='L1 Regularization')
plt.plot(X_test_sorted, y_pred_l2[sorted_indices], color='red', linestyle='-', label='L2 Regularization')

plt.legend()
plt.title("Comparison of Regularization Techniques in Neural Network")
plt.show()

Output:

MSE without regularization: 0.040118093684423727
MSE with L1 regularization: 0.10542759258095236
MSE with L2 regularization: 0.09386316314562934

Conclusion

Regularization is a powerful tool to fight overfitting, but it’s not a magic wand. Experiment with different levels of L1 and L2, and don’t forget other techniques like feature engineering and data augmentation. With the right approach, you can tame the overfitting beast and train your model to be the best cat-identifier in town!

References

https://jashrathod.github.io/2021-09-30-underfitting-overfitting-and-regularization/