When and How to Use Regularization in Deep Learning

Vanshika Bansal
SNU.ai
Published in
7 min readAug 2, 2020

Introduction:

The key role of Regularization in deep learning models is to reduce overfitting of data. It makes the network simple resulting in generalization on data points never encountered before. This helps in reducing the testing error when the model performs well only on the training set.

Before learning about regularization let’s understand briefly about different scenarios and where it can be helpful.

Identifying causes of errors:

A simple model might fail to perform well on the training data while complex models may succeed in fitting the training points close to the actual function. However, the ultimate goal of any model is to perform well on unseen data. The two main error causing scenarios are:

Underfitting:

A statistical model or an algorithm is said to have underfitted when it cannot capture the underlying trend of data. The model may be too simple or biased that it is not able to justify the data trend.

It usually happens when we try to build a linear model with a non-linear data. In such cases the rules of the deep learning model are too easy and will probably make a lot of wrong predictions.

Therefore, Underfitting > High Bias and Low Variance.

Techniques to reduce underfitting:
1. Increase model complexity
2. Increase number of features or perform feature engineering
3. Increase the duration of training

Overfitting:

A statistical model or an algorithm is said to have overfitted when it starts learning from all the noise or inaccuracies possessed by the training data, in a way that even minute details are recorded.

The causes of overfitting are generally non-parametric and non-linear methods because these types of algorithms have more flexibility to build the model based on the dataset and therefore sometimes build unrealistic models. As a result, they perform poorly on the testing data.

Therefore, Overfitting > Low Bias and High variance.

Techniques to reduce overfitting:
1. Increase training data or perform data augmentation.
2. Reduce model complexity.
3. Early stopping during the training phase depending on loss.
4. Regularization

[Image credits]

The above image shows conditions where underfitting, optimal(just-right) and overfitting occur. The goal is to train a model such that it results fall in the just-right scenario with bias-variance balance.

The key to any model training approach is to inspect the trends in training regularly to identify different bias-variance scenarios with the help of validation dataset.

The following table summarizes the intuitions behind this.

The ultimate objective of any model is to make training error small (reduces underfitting) while keeping the testing error close to it (reduces overfitting).

This requires appropriate selection of the algorithms and features to be used leading us to the Occam’s Razor Principle which states ‘among all the competing hypotheses that explain known hypothesis equally well, select the simplest one.’

In order to make the model better, we tend to over explore the features which can cause wrong fits and unsatisfying results in general. Rather, the focus should be on simplicity while exploring the features and algorithms.

Since this blog focuses on regularization, if you would like to learn more about bias-variance tradeoff, I recommend you to go through this article.

How Regularization reduces Overfitting:

Since deep learning deals with highly complex models, it is easy for it to overfit the training data. Even when the model performs well on training data, the testing error can be quite large resulting in high variance.

[Image credits] The increasing test error indicated overfitting

Consider training a neural network with cost function J denoted as:

Cost for a Logistic Regression example

where w and b are weights and bias respectively.

y’​ = predicted label

y = actual label

m = number of training samples

We add a regularization term to this function so that it penalizes the weight matrices of nodes within the network.

where, λ = regularization coefficient

Update of weight w for each layer:

In this way, regularization term is used to make some of the weight matrices nearly equal to zero to reduce their impact. As a result, the network will be much simpler and chances of overfitting the training data reduce since, different nodes are suppressed while training. The coefficient λ needs to optimized according to the performance on validation set to obtain a well-fitted model.

[Image credits] Reduction in number of nodes makes the network simpler

Another intuition lies in the activation function of output layer of a network. Since the weights tend to be smaller because of regularization, the function z is given by:

where a is the activation from the last layer.

Hence, z also becomes small. Thus, any activation function like sigmoid(z) or tanh(z) has better chances of capturing values within its linear range. This results in a comparatively linear behaviour of the then complex function reducing the overfitting.

An example of tanh(z) function is shown below.

[Image credits] tanh(z) tends to end up in encircled range

Common Regularization Techniques:

Now that we know how regularization is helpful to reduce overfitting, let us understand about the most common and effective practices.

1. L1 and L2 Regularization:

When we have a large number of features, the tendency of the model to overfit along with computational complexities can increase.

Two powerful techniques called Ridge (performs L2 regularization) and Lasso (performs L1 regularization) regression are performed to bring down the Cost function.

a) Ridge Regression for L2 Regularization: It penalize the variables if they are found to be too far from zero. Thus, decreasing model complexity while keeping all variables in the model.

[Image credits to owner] Ridge Regression

The red points in the above image correspond to the training set. The model represented by red curve fits these points but it is clear that the testing data (green points) will not perform very well.

So, Ridge regression helps in finding the optimum model tat reduces overfitting on training set represented by the blue curve by introducing bias.

This bias is known as the ridge regression penalty = (λ * slope²)

The slope² contains all the input intercepts squared and added, excluding just the y (output) intercept.

b) Lasso Regression (Least Absolute Shrinkage and Selection Operator) for L1 Regularization: Previously, in ridge regression the bias was increased in order to decrease the variance using slope squares.

In Lasso regression we add an absolute value of slope, |slope| instead of slope squares to introduce a little amount of bias represented by orange curve in the below image. This bias improves over training time.

Therefore, lasso regression penalty = (λ * |slope|)

[Image credits to owner] Lasso Regression

Equations for implementation in deep learning:

>L2 regularization uses the regularization term as discussed above to penalize the weights of complex models. In simple terms the equation for cost function becomes:

Cost function = Cost(from y and y’) + Regularization term,

The subscript 2 denotes L2

λ = Regularization coefficient

m = number of training samples and,

where n = number of features

>For L1 regularization the only difference is that the regularization term contains λ/m instead of (λ/2)*m. Therefore, the cost function for L1 is:

Subscript 1 denotes L1

2. Dropout Regularization:

This is the most intuitive regularization technique and is frequently used. At every iteration, some nodes are dropped randomly valid only for that particular iteration. A new (random) set of nodes is dropped for the upcoming iteration.

[Image credits] Nodes Used in Two Different Iterations

So, in this way every iteration uses different set of nodes so that the network produced is random and does not overfit the original complex structure. Bacause any feature in the network can be dropped at random, model is never heavily influenced by a particular feature thus reducing overfitting.

The number of nodes to be eliminated is decided by assigning keep-probabilities separately for each hidden layer of the network. Thus, we can control the effect produced by each and every layer on the network.

In conclusion, regularization is an important technique in deep learning. With sufficient knowledge of overfitting scenarios and regularization implementation, the results improve to a great extend.

--

--