Understanding regularization with PyTorch

Dealing with issue of Overfitting

Published in

Analytics Vidhya

4 min readAug 25, 2020

Overfitting is used to describe scenarios when the trained model doesn’t generalise well on unseen data but mimics the training data very well.

To deal with overfitting, there are various techniques that can be used. Let’s explore some of them.

1. Dropout

Dropout refers to dropping out units in a neural network. By dropping a unit out, it means to remove it temporarily from the network. The choice of which units to drop is random. Each unit is retained with a fixed probability p independent of other units.

This procedure effectively generates slightly different models with different neuron topologies at each iteration, thus giving neurons in the model, less chance to coordinate in the memorisation process that happens during overfitting. Thus making it better at generalization and cope with overfitting issue.

Implementation in PyTorch

torch.nn.Dropout(p: float = 0.5, inplace: bool = False)- During training, it randomly zeroes some of the elements of the input tensor with probability p. Output shape will remain same as of input while implementing dropout.

Let’s understand the impact of dropout, by using it in a simple convolutional neural network with MNIST dataset. I have created two networks one without dropout layers and other with dropout layers and ran it for 20 epochs .

Model 1- Without Dropout layers

Model 2- With Dropout layers

Inference:-

Without dropout model reaches train accuracy of 99.23% and test accuracy of 98.66%, while with dropout model these were 98.86% and 98.87% respectively making it less overfit as compared to without dropout model.

2. L1 and L2 Regularization

L1 regularization( Lasso Regression)- It adds sum of the absolute values of all weights in the model to cost function. It shrinks the less important feature’s coefficient to zero thus, removing some feature and hence providing a sparse solution .

L2 regularization( Ridge Regression)- It adds sum of squares of all weights in the model to cost function. It is able to learn complex data patterns and gives non-sparse solutions unlike L1 regularization.

Both of these regularizations are scaled by a (small) factor lambda (to control importance of regularization term), which is a hyperparameter .

Implementation in PyTorch

a) L1 Regularization

l1_penalty = torch.nn.L1Loss(size_average=False)
reg_loss = 0
for param in model.parameters():
→reg_loss += l1_penalty(param)

factor = const_val #lambda
loss += factor * reg_loss

b) L2 Regularization

The weight_decay parameter applies L2 regularization while initialising optimizer. This adds regularization term to the loss function, with the effect of shrinking the parameter estimates, making the model simpler and less likely to overfit.

3. Other techniques

Apart from dropout , L1 and L2 regularization as discussed in this post, other methods to deal with overfitting are :

Add more training data — Adding additional data will add more diversity to the train data and thus reducing the chances of overfitting .
Data Augmentations- It aids in increasing the variety of data for training models thus increasing the breadth of available information. Discussed in my previous blog post.
Batch Normalisation -Batch Normalisation tends to fix the distribution of the hidden layer values as the training progresses . Discussed in my previous blog post.
Early stopping — It implies to stop training of the model early before it reaches overfitting stage. Performance metrics (eg. accuracy, loss) can monitored for train and validation sets to implement this.

You can find the codes for both dropout implementation, L1 and L2 regularization in this repository.

Understanding regularization with PyTorch

Dealing with issue of Overfitting

1. Dropout

Implementation in PyTorch

2. L1 and L2 Regularization

Implementation in PyTorch

a) L1 Regularization

b) L2 Regularization

3. Other techniques

References

Written by Pooja Mahajan