PyTorch Zero to GANs Assignment-3

9 min readJun 12, 2020

So, starting this blog has me really confused as to how should I summarize all the stuff that just went down in the past week, the attempts to optimize the Feed Forward Neural Network on the CIFAR-10 dataset, led to 82 versions of the notebook and counting to be created. As you might expect the amount of experimentation to get the best results can be quite taxing.

So let’s have the results, shall we? So I got an accuracy of 60% with a loss of 1.31. Pretty good for a Feed-Forward Neural Network, this is pretty much the best you can get.

In this blog, I’m going to try my best to explain all that I learned as I worked on this assignment and in the end as a bonus I will also talk about how you can utilize a GPU to improve your ML training speeds, but more on that at the end.

Also regarding the format of this blog, I will be answering different questions I faced and how I solved them, as different sections of the blog.

So let’s begin…

What is this dataset?

The dataset I worked on in this assignment was the CIFAR-10 dataset, which contains RGB pictures of different objects and animals etc. The classes of images in the dataset are as follows:

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Here are 256 images from the CIFAR-10 dataset, that have been normalized

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

How did you prepare the data?

In the beginning few versions of the project, I just used the `ToTensor` transformation of PyTorch that converts the PIL images to tensors with pixel values between 0 and 1 for every channel.

It was only later after many versions, did I realize that to improve the model’s learning ability and make it easier for it to find the important distinguishing features of images did I actually normalize the images and randomly flipped images horizontally

Preparing and Transforming the Data from the Training Set and the Test Set

There were other transformations that I played with such as the ones you can see commented in the cells above. After a lot of trial and error, I realized that I should not have tried them as they did not help improve the result. These transformations were too much for a Feed-Forward Neural Network to deal with and rather than helping the model generalize the results they pretty much were coming in its way to recognize the images! but these transformations are preferable for CNNs though.

There are some important observations though that I learned from using these transformations that I feel are worthwhile to talk about:

RandomPerspective():

Performs a Perspective transformation of the given PIL Image randomly with a given probability. It essentially changes the angle with respect to which we see the image. For easier understanding, imagine you were holding a photo frame in front of your eyes, now tilt the frame side to side, up and down keeping the position of your head fixed, how you see the photo frame in these different angles is what the randomPerspective() function does to the images as well.

ColorJitter()

This transformation changes the hue, brightness, contrast and other such features of the images.

RandomGrayscale()

This transformation made all the 3 layers of RGB to the same value. This transformation was not very useful as some images that were originally too dark and lacking contrast, were completed ruined due to this transformation. I made the images a black square, leaving no information left for the model to work on.

RandomErasing()

This transformation essentially removed parts of an image, in order for the model to not over learn the images in the train set and force it to look for more intricate patterns in the dataset.

The data look well prepared and transformed, but show us some examples from the dataset!

So here are some randomly picked examples from the dataset, after the transformation have been applied:

Not really the clearest image of a dog but the model gets it.

this image looks barely recognizable, but we can still stretch our imagination a bit and conclude that the image best matches the image of a dog.

There are some images that are modified beyond recognition, for example

This is supposed to be an image of a frog but even with all my imagination, I just can’t see it.

This image of a frog is much clear than the previous one

This image looks like an ostrich or an emu, but nonetheless a bird for sure

Here is a quick distribution of the dataset, as in how many images belong to each class,

Ok, I understood the dataset, now show me the Model!

So the final model used on the dataset was made using multilevel inheritance, what I mean by this, is that first I made a general model by inheriting from the torch.nn.Module class of PyTorch, this class is called ImageClassificationBase, I then inherited from this class to make the final class which was used to instantiate the model which was trained on the dataset, this class is aptly called CIFAR10Model .

If your wondering here is the code for the 2 classes

This function measures the accuracy of the model against the validation set and the test set as well based on the arguments to the function. This function is used quite a lot in our 2 classes so it was apt to mention it here.

The parent class ImageClassificationBase

The child class used to instantiate the model aptly called the CIFAR10Model

Also just for reference below is the diagram for multilevel inheritance:

A diagram of Multilevel inheritance, in case you forgot

Here the Base Class is torch.nn.Module, the Intermediary Class is ImageClassificationBase and the Derived Class is CIFAR10Model

Hey! Where are the layer values, I don’t understand LAYER_1, LAYER_2…etc

For rapid prototyping, I kept the layer values outside so that there is always a ‘Single Source of Truth’ for the values, this would help prevent unneeded confusion and hair-pulling, in cases where you forget to change the values in some places.

Just to throw some insights here, initially, I also used a large-sized model like what I used to get my final results, but I did not have any reason to do it, the choice was completely arbitrary, so I referred almost every notebook to see what others thought were the right values, and I also noticed everyone took such large values, so I challenged my self to find the smallest neural network layer combination that can still give good results, and I did just that, and by my findings, the smallest you can go, without losing much accuracy was by using 4 layers (256, 128, 128, 64). This gave a test set accuracy of 55.14% and loss of 1.47796, not bad right? but any smaller than this model size and you slowly but steadily start to lose accuracy and decrease your loss.

Also when going for bigger models, always keep this in mind, the bigger the model the more it can remember and if it finds it easier it will start to remember your train set itself, and will not really learn anything from the dataset, so make sure that you model size is not too big and make sure that you preprocess you input training data such that the model does not find it useful to memorize it.

To also share some insights on the activation functions, I pretty much tried every activation function in the PyTorch library and even added drop out layers to the model, but that did not help much as the dropouts were overkill for a Feed-Forward NN. In terms of activation functions, ReLU is pretty much the best one for the job. There are also research papers that describe it be very close to how the neurons in the human brain get activated.

Ugh, My head hurts with all this data, show me some graphs of the results!

So here is a graph of the losses as the epochs went by. An interesting note here is that as the accuracy improved and stabilized the losses only went one way and that was up! I found that to be quite interesting behaviour. I tried stopping the model when the loss was the lowest and then applying it on the test set but that was not very useful, and the model did not perform very well. I guess that accuracy is the best judgement factor here.

Graph of Loss Vs No. of epochs

Graph of Accuracy Vs No. of epochs

Nice, so what was the learning rates you used and how many epochs for each?

Learning rates that were useful (not commented out) and the ones that were not (commented out)

After trying a lot of different learning rates and sometimes going too slowly into the local minima or jumping right over it! the learning rates of 1e-1 for 20 epochs and 5e-4 for 60 epochs turned out to be the most ideal. The first learning rate helps to move to the local minima fast and the second helps in slowly descending into the minima to an optimal spot.

For the curious ones out there, here are the results of the epochs as they went by,

Results of the epochs as they went by

Many at times I was just staring at my screen waiting for the results of the next epoch to see the learning rates were good enough or not, whether I was descending into a local minimum or completely missing it by just going over it.

This was the most painful process as training and testing take a lot of time and it gets taxing waiting for the values to come in so that you can analyze then, make the necessary corrections and then run the model again to see if those changes yield a better result or at least the same one as the previous runs.

So, what was your result on the test set?

I’m really proud of this result, especially as I have used a pure Feed Forward Neural Network, this is supposed to be the highest value you can get with such a network, and I am really happy I reached it too.

Cool, now share some more insights!

Now to train your model faster, it is better to train it on a GPU rather than a CPU, as GPUs are built for fast matrix calculations, which is pretty much what we are doing for the most part in machine learning.

First, we check if the GPU is available

The variable device is assigned to the available device.

The class DeviceDataLoader takes the existing dataloader and moves it to the GPU

The DeviceDataLoader moves the contents, i.e. the tensors of the data loader to the GPU recursively.

Passing the dataloader to the DeviceDataLoader class, and instantiating objects of the class that we will use in training the model.

Moving the model to the GPU

Conclusion

So at the end of this whirlwind tour of this assignment, reliving the experience of working on assignment while writing this blog, I feel really happy that I tried quite a few things and learnt a lot as well by repeated experimenting, that I could share all of it in this blog.

Thanks for reading, see you in the next one!

References

My Notebook hosted on Jovian: https://jovian.ml/rishit-c-rc/kernel4e8ac24d6c-c63b0

The records of all the results of the different model designs I used: https://jovian.ml/rishit-c-rc/03-cifar10-feedforward/compare

The link to the forum containing the links to the notebooks and blogs of all the other participants of the PyTorch Course, who all inspired me to push myself to make the best possible model for the assignment I could: https://jovian.ml/forum/t/share-your-work-assignment-3/5706