Regularization techniques in Deep Learning

Pierre-Emmanuel Saint-Mézard
6 min readMay 24, 2024

--

What is regularization? Why use it? How does it work?

Regularization is a technique used in machine learning and statistical modeling to prevent overfitting and improve the generalization ability of models. Overfitting occurs when a model learns the training data too well, including its noise and outliers, which results in poor performance on new, unseen data.

Regularization introduces additional constraints or penalties to the model during the training process, aiming to control the complexity of the model and avoid over-reliance on specific features or patterns in the training data.

This concept mirrors teaching students to apply learned concepts to both known and new problems, rather than just recalling memorized answers. The goal of regularization is to encourage models to learn the broader patterns within the data rather than memorizing it.

Common Regularization techniques

L1 regularization

L1 regularization, also known as Lasso Regression, is a technique used in deep learning to prevent overfitting by adding a “penalty term“ to the loss function. This penalty term is proportional to the absolute value of the weights of the model. The primary goal of L1 regularization is to encourage sparsity in the model, which means that it drives some of the weights to zero, effectively performing feature selection.

Sparsity”: the presence of a large number of zero (or near-zero) values in the parameters or weights of a model, which helps reduce its computational complexity and memory requirements. Sparse models can achieve faster inference and training times by skipping computations involving zero-valued weights or activations.

L1 regularization increases bias but reduces variance. This tradeoff can be beneficial in scenarios where a simpler model is preferred, and overfitting needs to be kept in check.

Advantages:

  • Performs feature selection by driving some coefficients to zero
  • Helps dealing with high-dimensional datasets
  • Can handle irrelevant or less important features

Disadvantages:

  • Can lead to high sparsity, making the model less interpretable
  • Not effective when there are strong correlations between features
  • Computationally more expensive than L2 regularization

L2 regularization

L2 regularization, also known as Ridge regularization, is a similar technique used to prevent overfitting by adding a penalty term to the loss function. This penalty term is proportional to the square of the weights of the model. Unlike L1 regularization, which encourages sparsity by driving some weights to zero, L2 regularization tends to shrink the weights but does not eliminate them entirely.

Like L1 regularization, L2 regularization increases bias but reduces variance. L2 regularization is also known as weight decay because it penalizes large weights, encouraging the model to distribute the weights more evenly. This helps in preventing a particular weight from dominating the model.

Advantages:

  • Retains all features, improving model stability (shrinks coefficients without setting them to zero)
  • Easier to optimize due to differentiability
  • Reduces model complexity and prevents overfitting

Disadvantages:

  • Does not perform feature selection (doesn’t drive weights to zero)
  • More sensitive to outliers since it squares the weights
  • Computationally more expensive than L2 regularization

Dropout

Dropout regularization is a technique used to prevent overfitting in deep learning models by randomly “dropping out” (setting them to zero) a subset of neurons during the training process. This method helps the model to generalize better to new, unseen data by reducing its reliance on specific neurons and their connections.

Dropout regularization

During training, each neuron in the network is either retained with a probability p or dropped out with a probability 1−p. The probability p is a hyperparameter that can be tuned to achieve the desired level of regularization. Typically, 𝑝 is set between 0.2 and 0.5.

Training and Testing Phases:

  • Training Phase: During the forward pass, the network computes the output using only the remaining neurons. During the backward pass, gradients are computed only for these neurons, and the weights of the dropped-out neurons are not updated.
  • Testing Phase: Dropout is turned off, and the full network is used normally to compute the output. To make sure that the expected value of the output stays the same during training and testing, the weights of the remaining neurons are scaled by the retention probability 𝑝.

Advantages of Dropout regularization:

  • Prevents the network from becoming overly reliant on specific neurons and their connections (reduces overfitting).
  • Can be thought of as training an ensemble of several neural networks with different sets of neurons randomly dropped out. This improves the model’s ability to generalize to unseen data.
  • Introduces noise during training, which helps in generating additional training samples and improving the model’s robustness.

Disadvantages:

  • Increases training duration due to the random dropout of units in hidden layers.
  • When the network is small relative to the dataset, regularization can be unnecessary.
  • Adds extra hyperparameters like dropout probability and learning rate, requiring more testing and tuning.
  • Redundancy with Batch Normalization : evaluating model performance with and without dropout when using Batch Norm can help determine its necessity.

Data Augmentation

This technique involves creating new training samples by applying various transformations (mirroring, rotation, translation, scaling, adding noise…) to the existing data. This process effectively increases the size and diversity of the training dataset, which helps the model to generalize better to new data.

Data Augmentation for image recognition

Advantages:

  • Artificially increases the variability in the training data.
  • When data is scarce, it allows models to be trained more effectively even with limited data.
  • Reflects plausible variations of real-world objects, increasing the model’s robustness.

Disadvantages:

  • Increases training time and resources requirements.
  • Unrealistic transformations can cause the model to learn from distorted or irrelevant patterns, degrading its performance.
  • Domain-specific challenges: augmenting data must be done cautiously to preserve data integrity. Ex: compromising diagnostic accuracy in medical imaging analysis, or generating incoherent or contextually irrelevant text for NLP.
  • Higher dependency on initial data quality.

Early Stopping

Early stopping prevents overfitting by halting the training process when the model’s performance on a validation set starts to degrade. This method leverages the fact that overfitting typically occurs after a certain number of training iterations, even if the training error continues to decrease.

During training, the model’s performance is periodically evaluated on a separate validation set. If the validation error stops improving and begins to increase, it indicates that the model is starting to overfit the training data. The training process is stopped when the validation error begins to rise. The weights from the epoch with the best validation performance are retained, ensuring that the model does not overfit the training data.

Early stopping can be used in conjunction with other regularization techniques such as dropout, L1/L2 regularization, and data augmentation.

Early Stopping

Patience: the patience parameter is often used to determine how many epochs to wait for an improvement in validation performance before stopping the training. This helps to avoid stopping too early due to temporary fluctuations in validation error.

Advantages:

  • Stopping the training process at the right moment prevents the model from learning noise and details specific to the training data (overfitting), improving generalization.
  • Early stopping can make the training process more efficient by avoiding unnecessary iterations after optimal performance is reached.
  • A relatively simple and non-intrusive method that does not require changes to the model architecture or the loss function.

Disadvantages:

  • Possible underfitting (stopping too early).
  • Finding the right patience parameter can be challenging.
  • Early stopping is particularly effective for deep learning models but may not be as suitable for other types of models.

--

--