Explaining L1 and L2 regularization in machine learning

Fernando Jean Dijkinga, M.Sc.
10 min readJan 2, 2024

--

Introduction to Regularization

In the realm of machine learning, the term ‘regularization’ refers to a set of techniques designed to prevent a common problem known as overfitting. Overfitting occurs when a model becomes too closely attuned to the training data, capturing not only the underlying patterns but also the noise and anomalies specific to that dataset. While such a model performs exceedingly well on the training data, its ability to generalize to new, unseen data is significantly compromised. This is where regularization steps in, serving as a counterbalance to overfitting, ensuring that the model remains versatile and performs consistently across various datasets.

Regularization achieves this by imposing constraints on the model during the training process. These constraints prevent the model from becoming overly complex and intricate, which is often the root cause of overfitting. By simplifying the model in a controlled manner, regularization ensures that it captures the true patterns and relationships inherent in the data, enhancing its ability to generalize.

Understanding L1 Regularization (Lasso)

Mathematical Formulation

At the core of L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), is a simple yet powerful modification to the loss function used in a machine learning model. The standard loss function, which measures the difference between the predicted and actual values, is augmented by adding a penalty term. This penalty term is defined as the sum of the absolute values of the model coefficients, mathematically represented as:

Here, Lossₒᵣᵢ represents the original loss function (like mean squared error in linear regression), λ is a hyperparameter that determines the strength of the regularization, and wi​ denotes the coefficients of the model. The inclusion of the absolute value of the coefficients as a penalty encourages the model to keep these coefficients as small as possible.

Effect on Model

The primary consequence of L1 regularization is its tendency to drive some of the model coefficients to zero, effectively excluding certain features from the model. This phenomenon is particularly beneficial when dealing with high-dimensional data, where some features might be irrelevant or redundant. By pushing coefficients to zero, L1 regularization performs feature selection, simplifying the model and enhancing its interpretability.

This aspect of L1 regularization is crucial in scenarios where model simplicity and feature selection are as important as prediction accuracy. It helps in avoiding overfitting by constructing a model that maintains a balance between complexity and performance.

Geometric Interpretation

Geometrically, L1 regularization can be visualized as imposing a constraint on the coefficient space of the model. Imagine the coefficients of a model plotted in a multi-dimensional space. The penalty term in L1 regularization confines these coefficients to lie within a diamond-shaped (or square-shaped in two dimensions) region centered at the origin.

This diamond-shaped constraint region intersects the axes of the coefficient space, which is where the coefficients can become zero. As the regularization strength (the value of λ) increases, the size of this diamond shrinks, pulling more coefficients towards zero, thus promoting sparsity in the model. This geometric representation provides an intuitive understanding of how L1 regularization encourages sparsity and why it is effective for feature selection.

Understanding L2 Regularization (Ridge)

Mathematical Formulation

L2 regularization, commonly known as Ridge regression, introduces a different type of penalty to the loss function of a machine learning model compared to L1 regularization. In L2 regularization, the penalty term is the sum of the squares of the model coefficients. This is mathematically represented as:

In this formula, Lossₒᵣᵢ​ refers to the original loss function (such as mean squared error in linear regression), λ is the regularization parameter that controls the strength of the regularization effect, and wi​ represents the coefficients of the model. The key aspect of the L2 penalty is the squaring of the coefficients, which tends to reduce their magnitude.

Effect on Model

The primary effect of L2 regularization is to shrink the coefficients towards zero, but unlike L1 regularization, it does not set them to zero. This shrinkage helps in reducing model complexity and preventing overfitting, particularly in situations where the dataset has highly correlated features. By penalizing the magnitude of the coefficients, L2 regularization ensures that the model does not become overly reliant on any single feature, thereby maintaining a balance in the contribution of all features.

This characteristic of L2 regularization is especially valuable when dealing with multicollinearity, as it helps in distributing the effect of correlated variables across multiple features, enhancing the model’s stability and performance.

Geometric Interpretation

From a geometric standpoint, L2 regularization can be visualized as imposing a spherical (or circular in two dimensions) constraint on the coefficient space of the model. This constraint is visualized as a sphere centered at the origin of the coefficient space, with the radius of the sphere determined by the strength of the regularization parameter λ.

Within this spherical constraint, the coefficients are allowed to vary. As the value of λ increases, the sphere’s radius decreases, pulling the coefficients closer to zero. However, due to the spherical nature of the constraint, the coefficients are shrunk uniformly towards zero and are not driven to zero entirely. This uniform shrinkage is the hallmark of L2 regularization and is key to its ability to handle multicollinearity and enhance model stability.

Comparing L1 and L2 Regularization

Sparsity

One of the most significant differences between L1 and L2 regularization lies in their impact on model sparsity. Sparsity refers to the number of feature coefficients that are reduced to zero, effectively removing them from the model. L1 regularization, with its absolute value penalty term, inherently promotes sparsity. This property is particularly useful in high-dimensional datasets where feature selection is crucial. By driving certain coefficients to zero, L1 regularization simplifies the model and enhances interpretability by identifying the most relevant features.

In contrast, L2 regularization does not inherently lead to sparsity. Due to its squared penalty term, L2 regularization shrinks the coefficients towards zero but typically does not set them to zero. This results in a model where all features are retained, albeit with reduced influence. L2 regularization is more about controlling model complexity and preventing overfitting through a balanced contribution of all features, rather than feature selection.

Solution Paths

The solution paths of L1 and L2 regularization, which represent the changes in coefficient values as the strength of regularization varies, also differ significantly. In L1 regularization, the solution path is piecewise linear, with coefficients hitting zero at certain values of the regularization parameter λ. This is because the L1 penalty forces coefficients to zero as λ increases.

In the case of L2 regularization, the solution path is more smooth and continuous. As λ increases, the coefficients gradually shrink towards zero but do not abruptly become zero. This continuous shrinkage is due to the quadratic nature of the L2 penalty, which uniformly reduces the magnitude of all coefficients.

Bias-Variance Tradeoff

Both L1 and L2 regularization techniques navigate the bias-variance tradeoff, but they do so differently. The bias-variance tradeoff is a fundamental concept in machine learning that describes the tradeoff between a model’s complexity (variance) and its accuracy in approximating the underlying data patterns (bias).

L1 regularization, by promoting sparsity, tends to have a higher bias but lower variance. The elimination of certain features (increasing bias) can lead to a simpler model that generalizes better to unseen data (lower variance).

L2 regularization, conversely, typically results in a lower bias but higher variance compared to L1. By shrinking coefficients uniformly and retaining all features, L2 regularization maintains a more detailed model (lower bias) but can lead to models that are more sensitive to fluctuations in training data (higher variance).

In summary, the choice between L1 and L2 regularization hinges on the specific needs of the dataset and the desired balance between bias and variance. While L1 is preferable for models requiring feature selection and simplicity, L2 is more suitable for models where retaining all features is important, and a balanced contribution from all variables is desired.

Implementing L1 and L2 Regularization

Implementing L1 and L2 regularization in TensorFlow involves adding the regularization terms to the loss function manually or using built-in functionalities. Here’s a basic guide:

TensorFlow Implementation

TensorFlow, through its Keras API, provides a straightforward way to add L1, L2, or both (L1_L2) regularizations to layers.

L1 regularization:

def ArtificialNeuralNetwork(x,y):
epochs = 100
verbose = 1
batch_size = 32

Adam = tf.keras.optimizers.legacy.Adam(
learning_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
amsgrad=False,
name='Adam')

model = Sequential()
model.add(Dense(units = x.shape[1], activation = 'linear', input_shape=[x.shape[1]]))
model.add(Dense(units = 8, activation = 'swish', kernel_regularizer=regularizers.l1(0.001)))
model.add(Dense(units = 8, activation = 'PReLU', kernel_regularizer=regularizers.l1(0.001)))
model.add(Dense(units = 1, activation = 'linear'))

model.compile(loss='mse', optimizer=Adam, metrics='mae')
hist = model.fit(y=y, x=x, epochs=epochs, verbose=verbose, batch_size=batch_size, validation_data =(x_val, y_val))

return model, hist

Here, within the sequential model, when adding layers of neurons to your model, we can include the l1 regularization, as well as define its coefficient by the function ‘kernel_regularizer=regularizers.l1()

L2 regularization:

def ArtificialNeuralNetwork(x,y):
epochs = 100
verbose = 1
batch_size = 32

Adam = tf.keras.optimizers.legacy.Adam(
learning_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
amsgrad=False,
name='Adam')

model = Sequential()
model.add(Dense(units = x.shape[1], activation = 'linear', input_shape=[x.shape[1]]))
model.add(Dense(units = 8, activation = 'swish', kernel_regularizer=regularizers.l2(0.001)))
model.add(Dense(units = 8, activation = 'PReLU', kernel_regularizer=regularizers.l2(0.001)))
model.add(Dense(units = 1, activation = 'linear'))

model.compile(loss='mse', optimizer=Adam, metrics='mae')
hist = model.fit(y=y, x=x, epochs=epochs, verbose=verbose, batch_size=batch_size, validation_data =(x_val, y_val))

return model, hist

The same is done for l2 regularization, however, the function is 'kernel_regularizer=regularizers.l2()'

Advanced Topics

Elastic Net Regularization

Elastic Net regularization is a sophisticated approach that combines the properties of both L1 (Lasso) and L2 (Ridge) regularization. It is particularly useful in situations where you want to harness the advantages of both L1 and L2. The Elastic Net penalty is a linear combination of the L1 and L2 penalties and is defined as:

Here, λ is the overall regularization strength, and α is a mixing parameter between L1 and L2 (with α = 1 being L1, and α = 0 being L2).

Elastic Net is particularly beneficial in scenarios where there are multiple correlated features. L1 alone might randomly select one feature among the correlated ones, while L2 would give all of them moderate weights. Elastic Net, however, can maintain the group effect by either selecting all correlated features or none, offering a balanced approach.

Numerical Stability and Optimization

Regularization techniques also have implications for numerical stability and optimization of machine learning algorithms. Regularization can improve the conditioning of the problem, making optimization algorithms more stable and efficient.

  • L1 Regularization can lead to sparse solutions, which can be beneficial in terms of computation and memory usage, especially in high-dimensional data. However, the non-differentiability at zero can pose challenges for gradient-based optimization methods.
  • L2 Regularization adds quadratic terms to the loss function, which tends to make the optimization landscape smoother. This can lead to more stable convergence in gradient descent algorithms.

Advanced optimization techniques, such as quasi-Newton methods or stochastic gradient descent, often incorporate modifications to handle the peculiarities introduced by L1 and L2 penalties.

Robustness to Outliers

The robustness of a model to outliers is an important consideration in machine learning. Regularization can influence this aspect to a certain extent.

  • L1 Regularization is generally more robust to outliers compared to L2. Since L1 regularization drives some coefficients to zero, it can inherently diminish the impact of outliers in those dimensions.
  • L2 Regularization, on the other hand, tends to be less robust to outliers. The squaring of coefficients in L2 can amplify the effect of outliers, potentially leading to skewed models.

The choice between L1 and L2 regularization, or a combination like Elastic Net, depends on the nature of your data and the specific requirements of your machine learning task. Understanding these nuances is key to building models that are not only accurate but also reliable and robust in various scenarios.

Practical Considerations

Choosing Between L1 and L2

The choice between L1 and L2 regularization hinges on the specific characteristics of your dataset and the objectives of your machine learning model:

Use L1 Regularization (Lasso) when:

  • You have high-dimensional data with many features, but you suspect only a few are actually important.
  • Feature selection is crucial, as L1 can drive the coefficients of irrelevant features to zero.
  • The model needs to be interpretable, and you want to identify the most significant features.

Use L2 Regularization (Ridge) when:

  • You deal with multicollinearity (high correlation among features) in your dataset.
  • You have a smaller number of features, and you expect all of them to influence the output.
  • You prioritize prediction accuracy over interpretability.

In some cases, Elastic Net, which combines L1 and L2, can be the best choice, especially when you need the benefits of both.

Conclusion

Summarize Key Takeaways

Regularization techniques like L1 and L2 play a crucial role in the development of robust machine learning models. They help in preventing overfitting, enabling models to generalize better to unseen data. The choice between L1 and L2 regularization depends on the dataset characteristics and the specific requirements of the problem. L1 regularization is key for feature selection and model interpretability, while L2 regularization is crucial for handling multicollinearity and improving model stability. Elastic Net offers a middle ground, combining the advantages of both L1 and L2.

Future Directions

Looking ahead, the field of regularization in machine learning is evolving, with research focusing on:

  • Adaptive Regularization: Techniques that adjust the regularization strength dynamically during training.
  • Regularization in Unsupervised and Semi-Supervised Learning: Exploring how regularization can be effectively applied in scenarios where labeled data is scarce.
  • Integration with Advanced Machine Learning Techniques: Including regularization in complex models like deep learning and reinforcement learning.
  • Domain-Specific Regularization: Developing regularization methods tailored to specific application domains like computer vision or natural language processing.

The ongoing advancements in regularization techniques will continue to shape the landscape of machine learning, making models more accurate, interpretable, and robust.

Please, don’t forget to tip! =)

--

--

Fernando Jean Dijkinga, M.Sc.

Ph.D student in animal breeding and genetics, specialist in data science and artificial intelligence