Understanding L1 and L2 Regularization: The Guardians Against Overfitting

Shubham Sangole
CodeX
Published in
4 min readMay 22, 2024
credits: https://sebastianraschka.com/

Introduction

Machine learning models often face the problem of overfitting, where a model performs well on training data but poorly on unseen data. Regularization techniques, such as L1 and L2 regularization, are essential tools to prevent overfitting by penalizing large coefficients in the model. This comprehensive guide will delve into the mathematical formulations of L1 and L2 regularization, their differences, and practical implementation in Python.

What is Regularization?

Regularization is a technique used to enhance the generalization ability of machine learning models by adding a penalty term to the loss function. This penalty discourages the model from fitting the training data too closely, thus improving performance on new, unseen data.

L1 Regularization (Lasso)

Mathematical Formulation

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), adds a penalty equal to the absolute value of the magnitude of coefficients. The objective function for L1 regularization can be written as:

where:

  • J(θ) is the regularized cost function.
  • yi​ is the actual output.
  • xij is the input feature.
  • θj​ are the model parameters.
  • λ is the regularization parameter (hyperparameter) controlling the amount of regularization.

Key Characteristics

  1. Sparse Solutions: L1 regularization tends to produce sparse models, meaning many coefficients are exactly zero, leading to feature selection.
  2. Feature Selection: It is beneficial when we have a large number of features, as it can automatically select the most relevant ones.

L2 Regularization (Ridge)

Mathematical Formulation

L2 regularization, also known as Ridge regression, adds a penalty equal to the square of the magnitude of coefficients. The objective function for L2 regularization can be written as:

where:

  • The terms are similar to those in L1 regularization, but the penalty term is the sum of squared coefficients.

Key Characteristics

  1. Non-Sparse Solutions: L2 regularization tends to produce models where all coefficients are small but non-zero.
  2. Smooth Solutions: It discourages large coefficients more aggressively, resulting in smoother models.

Practical Implementation in Python

We will use the scikit-learn library in Python to implement both L1 and L2 regularization.

L1 Regularization (Lasso)

First, let’s implement L1 regularization using the Lasso regression model from scikit-learn.

import numpy as np
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generating synthetic data
np.random.seed(0)
X = np.random.randn(100, 10)
y = X.dot(np.random.randn(10)) + np.random.randn(100)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and fitting the Lasso regression model
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Predicting and evaluating the model
y_pred = lasso.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Lasso Regression Mean Squared Error: {mse}")

# Displaying the coefficients
print(f"Lasso Regression Coefficients: {lasso.coef_}")

L2 Regularization (Ridge)

Now, let’s implement L2 regularization using the Ridge regression model from scikit-learn.

from sklearn.linear_model import Ridge

# Creating and fitting the Ridge regression model
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)

# Predicting and evaluating the model
y_pred = ridge.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Ridge Regression Mean Squared Error: {mse}")

# Displaying the coefficients
print(f"Ridge Regression Coefficients: {ridge.coef_}")

Tuning Regularization Parameters

Choosing the right value for the regularization parameter λ is crucial. This can be done using techniques like cross-validation.

Example: Cross-Validation for Lasso

from sklearn.model_selection import GridSearchCV

# Defining the parameter grid
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}

# Setting up GridSearchCV for Lasso
grid_search = GridSearchCV(Lasso(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best parameter and score
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best cross-validated score: {-grid_search.best_score_}")

Example: Cross-Validation for Ridge

# Setting up GridSearchCV for Ridge
grid_search = GridSearchCV(Ridge(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best parameter and score
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best cross-validated score: {-grid_search.best_score_}")

Conclusion

Regularization is a powerful technique to prevent overfitting in machine learning models. L1 regularization (Lasso) is useful for feature selection by producing sparse models, while L2 regularization (Ridge) produces smoother models with non-zero coefficients. Understanding and implementing these techniques can significantly improve the performance of your machine-learning models.

References

  1. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
  2. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.
  3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.

Further Reading

  1. “An Introduction to Statistical Learning” by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
  2. “Pattern Recognition and Machine Learning” by Christopher Bishop.
  3. Scikit-learn documentation: Lasso, Ridge.

By understanding and applying L1 and L2 regularization, you can make your machine-learning models more robust and generalizable, ultimately leading to better performance on unseen data.

Happy coding!

--

--

Shubham Sangole
CodeX
Writer for

Data-Muncher | On a Data Science voyage to explore new learnings and exciting possibilities.