Trials, errors and trade-offs in my deep learning model

Published in

Udacity PyTorch Challengers

6 min readJan 18, 2019

In this post,
I will explain ‘Bias-Variance Tradeoff’, ‘Regularization’, and ‘Learning rate decay’ in short. and tell you about my trial and errors for better performance of my deep learning model, including the reason of each ones and codes written by pytorch. I trained Classifier with Oxford flower data for final lab challenge of Pytorch Scholaship Challenge from facebook.
I wish that it would be helpful when you start to train your first model or struggle with your model to improve it!

Bias-Variance Tradeoff

Let’s start with the meaning of bias and variance.

high bias — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — high variance

The bias is an error from wrong assumptions. High bias can cause the model to miss relevant relations between features and target. Shortly, high bias can cause when the model is underfitted.

The variance is an error from sensitivity to small fluctuations in the training set. The high variance is related to overfitting. overfitting happens with the highly flexible model that fitted on random noise in training dataset.

overfitting after yellow sign, blue line = training loss & red line = validation loss

What we want to get is low bias and low variance.

But we have bias-variance tradeoff, it is whereby models with lower bias have a higher variance across samples, and vice versa.
take a look graphic that represents overfitting. The model almost perfectly fitted on the training data (higher variance), but not in the validation data. So the model is not generalized well (lower bias).

Variance getting lower with
bigger training data (data augmentation is one of way to increase number of data), dimension reduction and feature selection (you can do regularization)

Bias getting lower as features are added.

Regularization methods

regularization is the process of adding information from weights in order to prevent overfitting, improving generalization.

Early stopping

: A training procedure like gradient descent will tend to learn more and more complex functions as the number of iterations increases. By regularizing on time, the complexity of the model can be controlled. In practice, just terminate the training when the error repeatedly increases based on N number of time.

L1 regularization — — — L2 regularization

L1 regularization

: Penalize loss function with L1 norm(sum of absolute values of weights).
It reduces unimportant feature first. So, it has an effect on variable selection.

'''Creates a criterion that measures the mean absolute error (MAE) with L1 regularization.'''The loss can be described as:
criterion = torch.nn.L1Loss()

L2 regularization

: Penalize loss function with square of L2 norm(sum of square of weights).
It reduces outliers first. Improve the generalization ability of the linear model.

''' Creates a criterion that measures the mean squared error (squared L2 norm)'''
criterion = torch.nn.MSELoss()

Drop Out

: Deactivate random units during training.
In practice, add it after layer that you want to regularize.

dropout1 = torch.nn.Dropout(p=0.5) 
# p is a probability of an element to be zeroed.

Batch normalization

: Normalize output of activation at every epoch.
In practice, add it after layer that you want to regularize.

#  Applies Batch Normalization over a 2D or 3D input
bn_1d = nn.BatchNorm1d(num_features)# Applies Batch Normalization over a 4D input
bn_2d = nn.BatchNorm2d(num_features)#  Applies Batch Normalization over a 5D input
bn_3d = nn.BatchNorm3d(num_features)

Learning rate decay

If the learning rate is small, It is too slow to converge even it couldn’t converge.
If the learning rate is too big, the optimizer overshoots and loss getting worse.

Slowly reduce learning rate over time is called learning rate decay. It is helpful when your algorithm tends to towards optimal minimum, but end up wandering around the optimal and never exactly converge.

learning rate decay functions for linear cosine decay, stepwise decay, and cosine decay. (from google blog)

Trial and Errors

I worked on flower dataset from Oxford visual geometry group with 102 different species. (means 102 different targets)
There are 6552 number of training data and 818 number of validation data.
I checked training loss and validation loss during training model.
You can get my whole code for training here.

1. To build an image classifier, we usually use pretrained models. You can load a pretrained network here. I started training with pretrained densenet by image net dataset.

from torchvision import models
model = models.densenet121(pretrained= True)

2. I froze parameters from the feature extraction part(structured by convolution layers) of the model.

for param in model.parameters():
    param.requires_grad = False

And newly define an architecture of classifier part and trained it.
To determined architecture, I trained around 10 epochs with some architecture as experience.
Just I expected, the result was quite bad when classifier structured by only one fully connected layer. it is because it cannot get any non-linearity without activation. so connected 2 fully connected layers with activation function.

classifier = nn.Sequential(OrderedDict([
                          ('fc1', nn.Linear(1024, 512)),
                          ('Relu', nn.ReLU()),
                          ('dropout', nn.Dropout(0.3)),
                          ('fc2', nn.Linear(512,101))
                          ]))
    
model.classifier = classifier

3. Training loss and validation loss repeatedly decreased, but it is slow. I thought it is close to underfitting.

3–1 Adjust rate of dropout and even remove the dropout layer. so I tried things below. performance has improved marginally.
3–2 Maintain Dropout layer and try learning rate decay.
I applied step function that multiplies gamma 0.1 at every 10 epoch. then I set the learning rate higher than before. I expected to converge quickly to the optimal minimum with large steps on the beginning and small steps on the last. the result, it was a little faster than before on the very beginning of the training, but I faced where the loss no longer falls soon.

optimizer = optim.Adam(model.classifier.parameters(), lr=0.01)scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10)```don’t forget you need to use scheduler.step() on your training code, not an optimizer.step().```

4. the learning rate I tried on 3–2 wasn’t overshooting like yellow line, but it was still high like green one.

So I reduce the learning rate and use an amsgrad, advanced method from Adam optimizer. It uses long-term memory of past gradients when optimizing.

optimizer = optim.Adam(model.classifier.parameters(), lr=0.0001, amsgrad=True)

Then, I got above 90% accuracy at 11 epochs. After 11 epochs it tends to keep drop, but the decreasing rate of loss was reduced and vibration was increased.

My model was saved on epoch 26, with its best validation accuracy 94.99%.

4. Finally, I unfroze feature extractor part!

for param in model.parameters():
    param.requires_grad = True

after unfreezing, 1 epoch of training takes time far more than before. It is because parameters for training are increased.

Anyway, loss started to decrease again. But the model was not trained well after epoch 41. So I stopped the training.

But, It is true. Still, there are way to improve performance of model.
There are some resource for further.

General

Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)
Neural Network FAQ

Hyperparameter optimization

How about start to train your own model.
I wish it was helpful for you.
Good luck with your Training, and your Study!

Trials, errors and trade-offs in my deep learning model

Bias-Variance Tradeoff

Regularization methods

Learning rate decay

Trial and Errors

Written by Hong Min