Will dropout regularization prevents your model to overfit?

Diaz Agasatya
9 min readDec 4, 2018

--

One may argue that it’s better to overfit your model then reverse engineer it rather than going the other way around.

In this project, we can see the difference in accuracy and validation loss after implementing Dropout regularization to a Neural Network. We will build a Sequential Neural Network from scratch using PyTorch library to classify 10 different classes in the fashion-MNIST dataset. The dataset is a 28x28 greyscale image of clothes. We will dive in into the implementation of dropouts and prove if it will prevent overfitting.

900 images of clothing (greyscale) Fashion-MNIST

This project is inspired by:

Facebook Udacity PyTorch Challenge.

First, we will create a Neural Network without the regularization implementation, and our hypothesis is that we can deduct that over time our model will perform badly in the validation-set because the more we train our model with the training-set the better it gets by classifying specific characteristic of the testing data, thus creating bad generalization model for inference.

Let's import the Fashion-MNIST dataset

Let’s download the dataset using torchvision, typically we separate 20% of the dataset for the validation set. But in this case, we will download the dataset straight from torchvision.

import torch
from torchvision import datasets, transforms
import helper
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,0.5,0.5),
(0.5,0.5,0.5))])
traindataset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', download=True,train=True,transform=transform)trainloader = torch.utils.data.DataLoader(dataset=traindataset, batch_size=64, shuffle=True)testdataset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', download=True,train=False,transform=transform)testloader = torch.utils.data.DataLoader(dataset=testdataset, batch_size=64, shuffle=True)

We need to import torchvision to download the datasets and transforms. Then we use the transforms library to transform the images into tensors and normalize. It’s conventional to batch your training and validation sets to increase the speed of your training time and shuffling the data will also increase the learning variance of training and testing data.

Define the Neural Network

This model will have 2 hidden layers, the input layer will have 784 units and will have 10 output in the end layer since we have 10 different classes to classify. We will use cross-entropy loss because of its logarithm nature that will normalize our output close to zero or one.

from torch import nn
from torch.functional import F
class FashionNeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
# Create layers here
self.layer_input = nn.Linear(784,256)
self.layer_hidden_one = nn.Linear(256,128)
self.layer_hidden_two = nn.Linear(128,64)
self.layer_output = nn.Linear(64,10)

def forward(self, x):
# Flattened the input to make sure it fits the layer input
x = x.view(x.shape[0],-1)
# Pass in the input to the layer and do forward propagation
x = F.relu(self.layer_input(x))
x = F.relu(self.layer_hidden_one(x))
x = F.relu(self.layer_hidden_two(x))
# Dimension = 1
x = F.log_softmax(self.layer_output(x),dim=1)
return x

This Neural Network will use ReLU as the non-linear activation function for the hidden layers and use log-softmax activation for the output and negative log-likelihood function for our loss function. If we look at the documentation of cross-entropy loss in PyTorch library, this criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class. The loss can be described as:

x = input, class = labels. Original Formula

Notice that the Linear function at the end of forwarding propagation has dim=1, meaning that the total sum of the probability for each row of the output result must equal to 1. Giving the highest probability at a single element the highest probability of the image to be classified as the corresponding class index.

We have to make sure that if the output of our model is in the correct shape

# Instantiate the model
model = FashionNeuralNetwork()
# Get the images and labels from the test loader
images, labels = next(iter(testloader))
# Get the log probability prediction from our model
log_ps = model(images)
# Normalize the probability by taking the exponent of the log-prob
ps = torch.exp(log_ps)
# Print out the size
print(ps.shape)

Make sure that the output is:

torch.Size([64, 10])

Measuring the accuracy of our model

Since we want the highest probability of a class, we will use ps.topk to get a tuple of top-k values and top-k indices, for example, if the highest Kth element is in the 4 elements we will get the 3 as the index.

top_p, top_class = ps.topk(1,dim=1)
# Print out the most likely classes for the first 10 examples
print(top_class[:10,:])
First 10 example after forwarding propagation.

The top_class is a 2D tensor of size 64x1, while our label is a 1D tensor of size 64. To measure the accuracy between the label and our model prediction we have to make sure that the shape of the tensors is the same.

# We have to reshape the labels to 64x1 using the view() method
equals = top_class == labels.view(*top_class.shape)
print(equals.shape)

The output of the comparison tensor will be:

torch.Size([64, 1])

To calculate the accuracy of our model, simply we can count how many times our model predicted correctly. the == operator above will check row by row if our prediction is the same as the labels. The end result will be binary 0 being not the same and 1 being correctly predicted. We can use torch.mean to calculate the mean however we need to convert the equals to a FloatTensor.

accuracy = torch.mean(equals.type(torch.FloatTensor))# Print the accuracy
print(f'Accuracy: {accuracy.item()*100}%')

Train our model

Since we want our loss function to behave oppositely to our Logarithm Softmax function we will use the Negative Log Likelihood to calculate our loss.

from torch import optim# Instantiate the model
model = FashionNeuralNetwork()
# Use Negative Log Likelyhood as our loss function
loss_function = nn.NLLLoss()
# Use ADAM optimizer to utilize momentum
optimizer = optim.Adam(model.parameters(), lr=0.003)
# Train the model 30 cycles
epochs = 30
# Initialize two empty arrays to hold the train and test losses
train_losses, test_losses = [],[]
# Start the training
for i in range(epochs):
running_loss = 0
# Loop through all of the train set forward and back propagate
for images,labels in trainloader:
optimizer.zero_grad()
log_ps = model(images)
loss = loss_function(log_ps, labels)
loss.backward() # Backpropagate
optimizer.step()
running_loss += loss.item()

# Initialize test loss and accuracy to be 0
test_loss = 0
accuracy = 0

# Turn off the gradients
with torch.no_grad():
# Loop through all of the validation set
for images, labels in testloader:
log_ps = model(images)
ps = torch.exp(log_ps)
test_loss += loss_function(log_ps, labels)
top_p, top_class = ps.topk(1,dim=1)
equals = top_class == labels.view(*top_class.shape)
accuracy += torch.mean(equals.type(torch.FloatTensor))

# Append the average losses to the array for plotting
train_losses.append(running_loss/len(trainloader))
test_losses.append(test_loss/len(testloader))

Print out the model:

The lowest validation loss will be at epoch 5 with an accuracy of 87%.

This proves our hypothesis that overtime our model will train better but not in generalizing images outside the training dataset. We can see that the training loss decreases significantly for 30 epochs but our validation loss fluctuate between roughly 36–48%. This is a sign of overfitting, what happened is that the model learns the specific characteristic and patterns of the training dataset that it cannot classify correctly images outside of the dataset. This is bad in general because it means that the model can’t classify correctly if we use inference.

To see it clearly

Let’s plot these losses

# Plot the graph here
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as pltplt.plot(train_losses, label='Training Loss')
plt.plot(test_losses, label='Validation Loss')
plt.legend(frameon=True)
Number of epochs vs. Loss

Overfitting

We can see clearly from the figure above that our model doesn’t generalize good enough. This means that the model does not do a good job in classifying images outside the training dataset. This is really bad, this means that our model learns the only the specific of our training dataset, which becomes so specialized that it might only recognize images from the training set. If we see from the graph, the training loss decrease significantly every cycle, however, we can see the otherwise on the validation loss.

Regularization

This is where regularization comes in, one way to do this is to do L2 Regularization, which is called early-stopping which basically means that we will stop training our model when the validation loss is at its lowest. In this case, our validation loss is at its best after 3–5 epochs. This means that above 5 epochs our model generalization is getting worse every cycle.

However, there’s another way to solve this problem. We can implement dropouts to our model, to generalize more. Essentially our model acts greedily by snowballing on the large weights and sidelining the other weights to be trained. By having random dropouts, the nodes that have smaller weights will have their chance to be trained during the cycles thus giving a more generalize scores at the end. In other words, it forces the network to share information between weights, which giving more ability to generalize better.

Note:

During training we want to implement dropout, however, during validation, we want our full capability of our model since that’s when we can fully measure how accurate our model is to generalize these images. If we use model.eval() mode, we will turn off the dropouts and don't forget to turn it again during training by using model.train().

### Define our new Network with Dropouts
class FashionNeuralNetworkDropout(nn.Module):
def __init__(self):
super().__init__()
# Create layers here
self.layer_input = nn.Linear(784,256)
self.layer_hidden_one = nn.Linear(256,128)
self.layer_hidden_two = nn.Linear(128,64)
self.layer_output = nn.Linear(64,10)

# 20% Dropout here
self.dropout = nn.Dropout(p=0.2)
def forward(self, x):
# Flattened the input to make sure it fits the layer input
x = x.view(x.shape[0],-1)
# Pass in the input to the layer and do forward propagation
x = self.dropout(F.relu(self.layer_input(x)))
x = self.dropout(F.relu(self.layer_hidden_one(x)))
x = self.dropout(F.relu(self.layer_hidden_two(x)))
# Dimension = 1
x = F.log_softmax(self.layer_output(x),dim=1)
return x

This Neural Network will be very similar to the first model, however, we will add 20% dropout. Now let’s train this model!

from torch import optim# Instantiate the model
model = FashionNeuralNetworkDropout()
# Use Negative Log Likelyhood as our loss function
loss_function = nn.NLLLoss()
# Use ADAM optimizer to utilize momentum
optimizer = optim.Adam(model.parameters(), lr=0.003)
# Train the model 30 cycles
epochs = 30
# Initialize two empty arrays to hold the train and test losses
train_losses, test_losses = [],[]
# Start the training
for i in range(epochs):
running_loss = 0
# Loop through all of the train set forward and back propagate
for images,labels in trainloader:
optimizer.zero_grad()
log_ps = model(images)
loss = loss_function(log_ps, labels)
loss.backward() # Backpropagate
optimizer.step()
running_loss += loss.item()

# Initialize test loss and accuracy to be 0
test_loss = 0
accuracy = 0

# Turn off the gradients
with torch.no_grad():
# Turn on Evaluation mode
model.eval()
# Loop through all of the validation set
for images, labels in testloader:
log_ps = model(images)
ps = torch.exp(log_ps)
test_loss += loss_function(log_ps, labels)
top_p, top_class = ps.topk(1,dim=1)
equals = top_class == labels.view(*top_class.shape)
accuracy += torch.mean(equals.type(torch.FloatTensor))

# Turn on Training mode again
model.train()

# Append the average losses to the array for plotting
train_losses.append(running_loss/len(trainloader))
test_losses.append(test_loss/len(testloader))

Print the result:

Accuracy increases over time and model is not overfitting

The target here is to have validation loss as low as our training loss, this means that our model is fairly accurate. Let’s plot again the graph and see the difference after regularization. Even though the accuracy level only increases by 0.3% overall, the model did not overfit since it keeps its balance for all of the nodes to be trained during training. Let’s plot the graph and see the difference:

# Plot the graph here
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as pltplt.plot(train_losses, label='Training Loss')
plt.plot(test_losses, label='Validation Loss')
plt.legend(frameon=True)
Overfitting is gone!

Inference

Now that our model can generalize better, let’s try to feed our model with an image outside the training data set and visualize the classification of our model.

# Make sure to make our model in the evaluation mode
model.eval()
# Get the next image and label
images, labels = next(iter(testloader))
img = images[0]
# Convert 2D image to 1D vector
img = img.view(1, 784)
# Calculate the class probabilities (log-softmax) for img
with torch.no_grad():
output = model.forward(img)
# Normalize the output
ps = torch.exp(output)
# Plot the image and probabilities
helper.view_classify(img.view(1, 28, 28), ps, version='Fashion')
Awesome!

Conclusion

This is great! We can see the significant balance between the training loss and validation loss. It’s safe to say that if we train the model for more cycles and fine-tune our hyperparameters, the validation loss will decrease. We can see from the graph above that our model generalize better over time, the model achieves better accuracy after 6–8 epochs and it’s safe to say that the model prevents overfitting by implementing dropouts to the model.

Thank you so much for your time, and please check out this repository for the full code!

This is my Portfolio and Linked-In profile :)

--

--