Effect of Regularization in Neural Net Training

Published in

Deep Learning Experiments

14 min readMay 27, 2020

co-authored with Daryl Chang

Welcome to another installment in our Deep Learning Experiments series, where we run experiments to evaluate commonly-held assumptions about training neural networks. Our goal is to better understand the different design choices that affect model training and evaluation. To do so, we come up with questions about each design choice and then run experiments to answer them.

In this article, we seek to better understand the impact of regularization on training neural networks. In particular, we will cover the following:

What is overfitting and why is it bad?
What is regularization and how can it prevent overfitting?
What are different regularization techniques and how they affect model performance?
How to select the right regularization method for a problem?

What is Model Overfitting?

Overfitting is an important aspect of training a machine learning model. It refers to the situation where a model fits too closely to the training data that it even learns the noise in the data as concepts. The noise refers to the randomness in the data which is not representative of its true distribution. This reduces model’s ability to generalize on an unseen or test dataset. Figure 1 shows an example of overfitting where a classification model (leftmost) learnt a complex decision boundary to fit the noise in the data.

Figure 1: Illustration of Overfitting, taken from Sachin’s medium post

There are several different ways to prevent overfitting, such as getting more training data, regularization, early stopping, ensembling, etc. In this article, we will talk about how different regularization techniques can help with the overfitting problem.

What is Regularization and how it prevents overfitting?

Regularization refers to the practice of constraining /regularizing the model from learning complex concepts, thereby reducing the risk of overfitting. This can be done by either constraining attributes of the model like weights (L1 or L2 regularization), activations (Drop connect regularization), architecture (Dropout regularization), loss function (Auxiliary Loss), etc.

In this article, we will evaluate following regularization techniques:

Dropout Regularization
L2 Regularization
L1 Regularization

How are the experiments set up?

We will train a neural net using different optimizers and compare their performance. The code for these experiments can be found on Github.

Dataset: we use the Cats and Dogs dataset, which consists of 23,262 images of cats and dogs, split about 50/50 between the two classes. Since the images are differently-sized, we resize them all to the same size. We use 20% of the dataset as validation data (dev set) and the rest as training data.
Evaluation metric: we use the binary cross-entropy loss and accuracy on the validation data as our primary metrics to measure model performance.

Figure 2: Sample images from Cats and Dogs dataset

Base model: we also define a base model that is inspired by VGG16, where we apply (convolution ->max-pool -> activation -> batch-norm) operations repeatedly, using ReLU as the activation function for the convolution. Then, we flatten the output volume and feed it into two fully-connected layers (dense -> activation -> batch-norm) with 256 units each, using ReLU as the activation again. Finally, we feed the result into a one-neuron layer with a sigmoid activation, resulting in an output between 0 and 1 that tells us whether the model predicts a cat (0) or dog (1).

Figure 3: Base model architecture (created using NN SVG)

Training: we use a batch size of 32 and the default weight initialization (Glorot uniform). The optimizer is SGD with a learning rate of 0.01. We train until the validation loss fails to improve over 50 iterations.
Hardware: all models are trained on a single Nvidia Tesla P100 GPU on Google cloud.

Dropout

Dropout refers to ignoring some nodes, chosen randomly, in the model during the training time. Ignoring means that these nodes are not considered during the forward and backward propagation. There are several ways to apply dropout in a neural network, among which the most commonly used is called as Inverted Dropout and is described below. Dropout prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently.

For every batch during the training, every node is either dropped with a probability of p or kept with a probability of 1-p. The quantity p is called as dropout rate. Incoming and outgoing edges of a dropped node are also removed. This is illustrated in Figure 4.
The activations of all the kept nodes are scaled by 1/(1-p).
During the test time, we use the full network without dropping any node.

Figure 4: Pictorial illustration of dropout from the original dropout paper [1]

How does dropout rate affect overfitting?

We added a dropout layer after every layer in our original network. We applied different dropout rates in convolution and dense layers as suggested in the the original dropout paper [1]. Figure 5 and Table 1 shows the effect of different dropout rates (only the important ones are shown) on train and validation loss. The observations are summarized below:

The model with no dropout (dropout rate=0) overfits and generalizes very poorly. After few epochs as we go through the training data more no. of times (more epochs), the training loss decreases while the validation loss increases.
The extent of overfitting, i.e. the gap between training and test loss, reduces as we increase the dropout rate.
The validation loss reduces from 0.3193 to 0.1953 as we introduce dropout in the convolution layers in addition to the FC layers, suggesting the importance of dropout in all layers. However, large dropout rates in convolution layers lead to under-fitting (not shown in the plot).
As we increase the dropout rate, training time increases, i.e. it takes more epochs for the model to converge. According to Srivastava, Nitish, et al. in the original dropout paper [1], a major cause of this increase is that the parameter updates becomes very noisy as a result of dropout. Each training batch effectively tries to train a different random architecture. Therefore, the gradients that are being computed are not gradients for the final architecture that will be used at test time. Alternative explanation for this could be that dropout leads to fewer weight updates as it drops activations attached to the weights.

Figure 5: Effect of dropout rate on train loss (left) and validation loss (right)

Table 1: Stats for different dropout rates in the architecture

Figure 6 shows that generalization gap of the model significantly improves as a result of increasing dropout rate. We define generalization gap as the ratio of validation loss to training loss (the value in Figure 6 is capped at 5 for better visualization). The ideal value is 1 when the training loss and validation loss are close to each other. The extent of overfitting increases as the generalization gap goes over 1. We observe that the generalization gap for the model with conv_dr=0.2, dense_dr=0.6 remains close to 1 even after training for multiple epochs beyond convergence, i.e. no overfitting.

Figure 6: Effect of dropout rate on generalization gap (capped at 5)

How does dropout affect model weights and activations?

In order to build better intuitions of how dropout changes model behavior we have visualized model weights and activations. We will do this exercise for all the regularization methods discussed in this article.

Model Weights: In Figure 7 we plot the distribution of weights for the last 6 layers (3 FC and 3 CONV) of the model with and without dropout. On applying dropout, the distribution of weights across all layers changes from a zero mean uniform distribution to a zero mean gaussian distribution. This is similar to the weight decaying effect of L2 regularization on model weights (see the next section to learn about L2 regularization). Thus, dropout has some weight regularizing effect and is probably one of the reasons why dropout works so well as a regularizer. Another interesting observation is that the standard deviation of the gaussian curve decreases or the weight decaying effect becomes more prominent in the deeper layers. This suggests that dropout makes the model learn complex feature representations through the early layers and regularizes the model complexity in the later layers to prevent overfitting.

Adding dropout is equivalent to applying L2 regularization where the value of regularization coefficient increases with the layer depth

Figure 7: Effect of dropout on the weights of FC and CONV layers

Model Activations: Figure 8 shows the no. of non-zero activation nodes in the last FC layer averaged across a random mini-batch of validation data (32 examples). Increasing the dropout rate induces sparsity, i.e. the no. of nodes with non-zero activations decreases. This is consistent with the observation of Srivastava, Nitish, et al. in the original dropout paper [1].

But why is sparsity good for model? In [2] Glorot et al. have listed down the following benefits of sparsity in neural network. This is another reason why dropout works so well as regularizer.

Figure 8: Effect of dropout on last FC layer activation

Information Disentangling: A dense representation is highly entangled because almost any change in the input modifies most of the entries in the representation vector.
Efficient variable-size representation: Varying the number of active neurons allows a model to control the effective dimensionality of the representation for a given input and the required precision.
Linear separability: Sparse representations are also more likely to be linearly separable, or more easily separable with less non-linear machinery, simply because the information is represented in a high-dimensional space.

Dropout increases the sparsity in the model activations and uses only fewer nodes in the last layer for predicting an example

L2 Regularization

L2 Regularization or Ridge Regression adds squared magnitude of model weights as penalty term to the loss function.

Here, if lambda is zero then we get back original loss function. The penalty term in the equation restricts the model weights from growing too big thereby limiting the model’s ability from learning complex concepts wrt. any particular feature. However, if lambda is very large then it will make all weights close to 0 and lead to under-fitting.

How does L2 lambda affect overfitting?

We added L2 regularization in every layer in our original network with different lambda in convolution and dense layer. Figure 9 and Table 2 shows the effect of different lambda (only the important ones are shown) on train and validation accuracy. Please note that unlike dropout we are comparing accuracy instead of loss because loss is a function of lambda. The observations are summarized below:

The model with no regularization (lambda=0) overfits and generalizes very poorly. After few epochs as we go through the training data more no. of times (more epochs), the training accuracy increases while the validation accuracy doesn’t change.
The extent of overfitting, i.e. the gap between training and test accuracy, reduces as we increase lambda till a point. Further increase in lambda (not shown in the plot) restricts model’s learning ability and affects the performance negatively.
The validation accuracy increases from 0.797 to 0.888 as we add L2 regularization in the convolution layers in addition to the FC layers, suggesting the importance of applying L2 regularization in both convolution and FC layers. Similar to dropout, the value of lambda should be smaller in convolution layer than in FC layers.

Figure 9: Effect of L2 *lambda* on train accuracy (left) and validation accuracy (right)

Table 2: Stats for different L2 lambda in the architecture

Figure 10 shows that generalization gap of the model significantly improves as a result of increasing lambda.

Figure 10: Effect of L2 lambda on generalization gap (capped at 5)

How does L2 regularization affect model weights and activations?

Model Weights: In Figure 11 we plot the distribution of weights for the last 6 layers (3 FC and 3 CONV) of the model with and without L2 regularization. L2 regularization distributes the weights for all layers by gaussian distribution and shrinks their norm to small values.

L2 Regularization shrinks all the weights to small values, preventing the model from learning any complex concept wrt. any particular node/feature, thereby preventing overfitting.

Figure 11: Effect of L2 regularization on the weights of FC and CONV layers

Model Activations: Unlike dropout, L2 regularization doesn’t have any effect on the sparsity of activations of the last layer (plot not shown in this article). However, we observe in Figure 12 that on applying L2 regularization, the distribution of mean activations of the last FC layer shifts towards the left, i.e. the magnitude of activations become smaller. This is probably a side effect of shrinking weights.

Figure 12: Effect of L2 regularization on the activations of last FC layer

L1 Regularization

L1 Regularization or Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds absolute value of magnitude of coefficient as penalty term to the loss function.

If lambda is zero then we will get back original loss function whereas very large value will make coefficients zero hence it will under-fit.

How does L1 lambda affect overfitting?

We added a L1 regularization in every layer in our original network with different lambda in convolution and dense layer. Figure 13 and Table 3 show the effect of different lambda (only the important ones are shown) on train and validation accuracy. The observations are summarized below:

The model with no regularization (lambda=0) overfits and generalizes very poorly. After few epochs as we go through the training data more no. of times (more epochs), the training accuracy increases while the validation accuracy doesn’t change.
The extent of overfitting, i.e. the gap between training and test accuracy, reduces as we increase the lambda till a point. Further increase in lambda leads to under fitting (not shown in the plot).
The validation accuracy increases from 0.7893 to 0.8785 as we add L1 regularization in the convolution layers in addition to the FC layers. Similar to dropout and L2 regularization, the value of lambda should be smaller in convolution layer than in FC layers.
Training time increases as we add L1 regularization.

Figure 13: Effect of L1 *lambda* on train accuracy (left) and validation accuracy (right)

Table 3: Stats for different L1 lambda in the architecture

Figure 14 shows that generalization gap of the model significantly improves as a result of increasing lambda.

Figure 14: Effect of L1 lambda on generalization gap (capped at 5)

How does L1 regularization affect model weights and activations?

Model Weights: In Figure 15 we plot the distribution of weights for the last 6 layers (3 FC and 3 CONV) of the model with and without L1 regularization. Similar to L2 regularization, L1 regularization also shrinks the norm of weights to a very small value. However, the key difference between L1 and L2 regularization is that the former pushes most of the weights towards 0 and allows few high weights whereas the later pushes all the weights to a small value. This is evident if we compare the weights of the last FC layer (top left image in Figure 11 and Figure 15) with L1 and L2 regularization.

L1 Regularization pushes the weights corresponding to less important feature’s to 0, thereby removing those features altogether. This, in turn, makes the model simpler thereby removing overfitting.

Figure 15: Effect of L1 regularization on the weights of FC and CONV layers

Figure 16: Effect of L1 regularization on last FC layer activation

Model Activations: Similar to dropout, L1 regularization induces sparsity (Figure 16). On applying L1 regularization, the average activations of most of the nodes becomes close to 0 and only a few nodes with very high average activation values (~3 in the right pic of Figure 17). Smaller activations (~1e-5) leads to smaller weight updates, and are probably the root cause for slower training with L1 regularization.

Figure 17: Effect of L1 regularization on the activations of last FC layer

How To Select Regularization Method(s)?

So far, we have seen three different types of regularization techniques, but how do we select which one(s) to use for a problem? Recent studies show that dropout is the most used regularization among other techniques, which was also evident in our study as it had the best validation accuracy. Does that mean we shouldn’t consider other regularization at all?

To answer this question, we ran an experiment in which we trained 17 different models where we used a combination of all regularization methods discussed above with random values of hyper-parameters. In order to be more efficient, we searched the hyper-parameters of dropout in linear scale and L1 and L2 regularization in logarithmic scale respectively. We further imposed a condition that the hyper-parameters for CONV layers should be smaller than FC layers.

The mean, max and min validation accuracies of these 17 models were 0.8970, 0.9161 and 0.8596 respectively. Table 4 summarizes the performance of the best models obtained in the above sections against the best model obtained by using a combination of all regularization methods. We observe that the model which only uses dropout (2nd row) has the best validation accuracy followed by the best model obtained by random parameter search. This shows why dropout is so popular among deep learning practitioners. Please note that we can only compare accuracy and not loss in Table 4 as the loss is a function of L1/L2 lambda where these regularizations are used.

Table 4: Stats comparing the best models selected by different techniques

Conclusions

So, what does this all mean? What can we take away from these experiments?

Regularization is important for the model to generalize on unseen data. We observed that the validation loss of the model increased from 0.7796 to 0.9189 as a result of applying regularization.

Dropout has both weight regularization effect and induces sparsity. We observed that the weights are distributed by gaussian distribution as a result of applying dropout where the standard deviation of the distribution reduces with the layer depth. We also observed that dropout zeroes out most of the activations in the last layer.

L1 Regularization has a tendency to produce sparse weights whereas L2 Regularization produces small weights. Due to this property L1 regularization also leads to sparse activations.

Regularization hyper parameters for CONV and FC layers should tuned separately. In all of our experiments, we have seen that the regularization is required in both CONV and FC layers and the regularization hyper parameters in the CONV layers are smaller than the FC layers.

Dropout has the best performance among other regularizers. The validation accuracies of best model with dropout, L2 and L1 regularizations were 0.9189, 0.888 and 0.8785 respectively. We also tried using a combination of all regularizations and obtained a best validation accuracy of 0.9161. Please note that this could change if we change the dataset or if we perform more thorough hyper parameter search for other regularization methods.

Acknowledgements

Special thanks to Aashu Singh for lots of interesting discussions and suggestions for this article.

References

[1] Srivastava, Nitish, et al. Dropout: a simple way to prevent neural networks from overfitting. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

[2] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf

[3] Code available on Github.