HyperParameter Tuning: Fixing Overfitting in Neural Networks

Sanskar Hasija
6 min readAug 11, 2021

--

Quick methods to decrease high variance (overfitting) problems in neural networks.

Hyper-Parameter Tuning

Introduction

In my last blog, I discussed about the effect of various parameters on the effect on bias and how can we fix high bias ( problem of underfitting).

In this blog, we will go through some methods and techniques to fix the problem of high variance ( overfitting ) in neural networks. High Variance is a common problem that is faced during the training of a neural network. The problem of high variance arises when the model overfits the train data but does not perform well on the validation data. The problem usually symbolizes that the trained model has learnt the input-output mapping quite well but is unable to generalize properly on the cross-validation or the test set.

We will check the effect of various factors on validation accuracy and validation loss step by step in this blog.

Imports and Preprocessing

We will start by importing the TensorFlow, NumPy and Matplotlib libraries and initializing some hyper-parameters such as the number of epochs, learning rate and optimizer

import tensorflow as tf 
import numpy as np
import matplotlib.pyplot as plt
tf.random.set_seed(1)
EPOCHS = 40
LR = 0.0001
OPT = tf.keras.optimizers.SGD(LR , 0.99)
plt.style.use('fivethirtyeight')
plt.rcParams["figure.figsize"] = (8,5)

We will be using the famous Mnist dataset for the demonstration. The Mnist dataset contains 60,000 images with an 80:20 train-test split. All the images are grayscale and are of shape (28,28).

(x_train , y_train) , (x_test , y_test ) = tf.keras.datasets.mnist.load_data()x_train = x_train /255 
x_test = x_test/255

We can access this dataset directly through the TensorFlow library. The data is already separated into training and test subsets. In the next step, we will normalize our images.

MODEL DESIGN

We will start by building a simple neural network with no hidden layers, just an input and an output layer.

model = tf.keras.Sequential(
[tf.keras.layers.Flatten(input_shape = x_train.shape[1:]),
tf.keras.layers.Dense(10,activation = "softmax")])
model.compile(optimizer=OPT,
loss = "sparse_categorical_crossentropy",
metrics = ["accuracy"])

We will compile this model using sparse categorical cross-entropy as loss and set the metrics to accuracy.

EFFECT OF INCREASING DATA

We will train the above-defined model two times but with different data distributions. To demonstrate the effect of data on high variance we will define a new subset of the training data with only 60% of the total training data.

(x_train_partial , y_train_partial) =   (x_train[:30000], y_train[:30000])

The new ( x_train_partial , y_train_partial ) dataset has 30,000 images as compared to the 50,000 images in the original dataset. After training these datasets, we will now plot two plots to check the effects of increasing data. Validation accuracy vs the number of epochs plot and Validation accuracy vs the number of epochs.

Effect of Data

It is clearly visible from the above figure that increasing the data helps in fixing the problem of high variance.

EFFECT OF INCREASING HIDDEN LAYERS

Now we will increase the number of hidden layers in our network and verify its effect on the training accuracy of our model. We will train four different models with several hidden layers set to 1,2,3 and 5 layers respectively. The architecture of all 4 models is as follows:

one_layer_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape = x_train.shape[1:]),
tf.keras.layers.Dense(10 , activation = "relu"),
tf.keras.layers.Dense(10,activation = "softmax")])
two_layers_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape = x_train.shape[1:]),
tf.keras.layers.Dense(10 , activation = "relu"),
tf.keras.layers.Dense(20 , activation = "relu"),
tf.keras.layers.Dense(10,activation = "softmax")])
three_layers_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape = x_train.shape[1:]),
tf.keras.layers.Dense(20 , activation = "relu"),
tf.keras.layers.Dense(40 , activation = "relu"),
tf.keras.layers.Dense(20 , activation = "relu"),
tf.keras.layers.Dense(10,activation = "softmax")])
five_layers_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape = x_train.shape[1:]), tf.keras.layers.Dense(10 , activation = "relu"), tf.keras.layers.Dense(20 , activation = "relu"), tf.keras.layers.Dense(40 , activation = "relu"), tf.keras.layers.Dense(20 , activation = "relu"), tf.keras.layers.Dense(10,activation = "softmax")])

After training the full datasets for 20 epochs on all the above our models, we get the following figure for validation-accuracy and validation-loss comparison :

Effect of hidden layers

It is clearly visible that increasing the number of hidden layers increases the validation accuracy and decreases the validation loss. as we go further during the training process. For the mnist dataset, a choice of 3 hidden layers seems to generate the best results.

We will now use this 3 hidden layer neural network as our reference and check the effect of increasing nodes in different layers in this architecture.

EFFECT OF NUMBER OF UNITS(NODES) IN HIDDEN LAYERS

We will now increase the number of nodes in different layers of the previously trained 3 layer network. A common practice is to set the number of units in different layers in descending order. We will train two different models for this demonstration. The first model will have a small number of units whereas the second model will have a larger number of units.

small_units_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape = x_train.shape[1:]),
tf.keras.layers.Dense(80,activation = "relu"),
tf.keras.layers.Dense(40,activation = "relu"),
tf.keras.layers.Dense(20,activation = "relu"),
tf.keras.layers.Dense(10,activation = "softmax")])
large_units_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape = x_train.shape[1:]), tf.keras.layers.Dense(512,activation = "relu"), tf.keras.layers.Dense(128,activation = "relu"), tf.keras.layers.Dense(64,activation = "relu"), tf.keras.layers.Dense(10,activation = "softmax")])

We have set the units in the second model as powers of 2. This is considered the best default choice for setting up the number of units in our neural networks.

After training the full datasets for 20 epochs on the above two our models, we get the following figure for validation accuracy and validation loss comparison :

Effect of units

The number of units has clearly a large impact on both validation accuracy and validation loss. In the above example, the validation accuracy increased from 92% to more than 98% with increasing the number of layers as well as increasing the number of units in each layer.

EFFECT OF BATCH NORMALIZATION

Next, we will check the effect of adding batch normalization layers on fixing high variance. We will use the previous best model as a reference for verifying the effect of batch normalization.

bn_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape = x_train.shape[1:]),
tf.keras.layers.Dense(512,activation = "relu"),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(128,activation = "relu"),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(64,activation = "relu"),
tf.keras.layers.Dense(10,activation = "softmax")])

We have added a pair of BatchNormalization layers between the hidden layers. We will now train this model and compare its validation accuracy and validation loss with our previous best model.

Effect of Batch Normalization

It is clearly visible that adding batch normalization certainly helps in increasing the validation accuracy and also keeping the validation loss constant as compared to the model without batch normalization.

EFFECT OF DROPOUTS

Lastly, We will check the effect of dropout layers in fixing the problem of high variance. We will add two Dropout layers to our previous best model.

dropout_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape = x_train.shape[1:]),
tf.keras.layers.Dense(512,activation = "relu"),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(128,activation = "relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(64,activation = "relu"),
tf.keras.layers.Dense(10,activation = "softmax")])

We have added two Dropout layers between the hidden layers with dropout probabilities of 0.3 and 0.2 respectively. We will now train this model and compare its validation accuracy and validation loss with our previous best model.

Effect of Dropout

It is clearly visible that adding dropout layers between our hidden layers does help in increasing the validation accuracy and as well as reduce the validation loss more smoothly and fastly as compared to a normal network with dropouts.

CONCLUSION

After training the same data on multiple models with different hyperparameters, we can conclude that the following changes can help us in fixing high variance:

  • Increasing the amount of training data.
  • Increasing the number of hidden layers.
  • Increasing the number of hidden units.
  • Adding Batch Normalization.
  • Adding Dropouts.
  • Training for a higher number of epochs.
  • Trying more neural networks.

I hope you all enjoyed this quick small blog!!!

The code for all the models and graphs in this blog can be accessed here —

https://github.com/sanskar-hasija/Hyperparameter-Tuning

--

--