Introduction to image classification with PyTorch (CIFAR10)

Published in

The Startup

7 min readJun 12, 2020

Image classification is one of the most fundamental problems that can be trivial for a human brain, but a seemingly impossible task for a computer. But with the right techniques, it can be easily done!

The aim of this article is to give you a brief summary of how to get started with any image classification task with the help of PyTorch. I have gone with a fairly simple linear layer architecture so that the focus is on the broad idea and not on specifics such as a convolutional neural network.

OK so lets get started.

I will assume you have PyTorch already set up. If you wish to carry out this task on your local setup (NVIDIA cuda supported GPUs) this link should help you set it up: https://pytorch.org/get-started/locally/. If you want to to run this on free compute provided by Kaggle or Google Colab, which is what I prefer, then you are welcome to do so. They are fairly easy to setup, with an import torch being all you need. OK lets get to the fun bit.

The ability to try many different neural network architectures to address a problem is what makes deep learning really powerful, especially compared to shallow learning techniques like linear regression, logistic regression etc. In this tutorial, we will first experiment with a linear network, and then try a convolutional setup.

The dataset: CIFAR10. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

Importing the libraries needed:

These will all help us along the way for various purposes.

Now let’s get the dataset we have been talking about:

The test_dataset variable contains only the testing part of the dataset which we have downloaded in the above line. We have done this by typing train=False

Let us first understand this dataset which we have just downloaded. First we shall see how many images we are dealing with here:

50000 images, as we have read above

How about the test partition of our dataset?

10000 images, totaling 60000 images in the total dataset

Now let’s have a look at the classes that are present in the dataset. These are the types of the images that are present, or the labels given to the images.

There are a total of 10 classes

Understanding how each instance in the dataset is represented in also of great importance in order to understand how to manipulate the tensors later in the code.

Here 3 stands for the channels in the image: R, G and B. 32 x 32 are the dimensions of each individual image, in pixels

matplotlib expects channels to be the last dimension of the image tensors (whereas in PyTorch they are the first dimension), so we'll the .permute tensor method to shift channels to the last dimension. Let's also print the label for the image.

Preparing the data for training

We’ll use a validation set with 5000 images (10% of the dataset). To ensure we get the same validation set each time, we’ll set PyTorch’s random number generator to a seed value of 43.

Let’s use the random_split method to create the training & validation sets. The data in each is randomly distributed each time you run this function.

We will set the batch size as 128.

We can now use DataLoader to load the data from the datasets in batches of the size defined in the cell above. The parameters are fairly self explanatory.

Now we can visualize the data:

make_grid is the helper function from TorchVision

Clearly we can see that a resolution of 32x32 pixels is not the best when we want to identify the contents of the image with our naked eye, giving us an idea of the difficulty of the task at hand.

Now we can create a base model class:

This class will contain everything except the model architecture i.e. it wil not contain the __init__ and __forward__ methods. We will later extend this class to try out different architectures. In fact, you can extend this model to solve any image classification problem.

A simple function helping us determine the accuracy of the predictions by comparing the prediction with the actual label assigned to that image.

We have made 4 functions above, which we will be using later in the code. The first 2 functions help us calculating the loss of the model at each stage, during training and during testing on the validation dataset, respectively. We are using cross_entropy to measure the loss. It is also often called as log loss. It measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. There are other loss measuring methods that can be used such as L1 loss, smooth L1 loss etc, but for classification, Cross entropy performs the best.

A question that you may have at this point: What does .detach() do? Well, when we have to calculate and set gradient automatically with loss.backwards() , as we will in the fit() function later down the code, the final loss tensor holds a reference to the entire computation graph up until that point. We calculate the loss in batches, so at each point, the previous variables are not of use anymore, and hence will only take up memory, which eventually runs out in case this detach is not done.

The last function is just used to print the loss and accuracy at each iteration.

Now we define two more functions:

evaluate calls the validation_step() on each batch and returns the output.

fit is a really important function here. It is the function that performs the training. After a model is trained, it can be used to make predictions. The optimization used in this function is Stochastic Gradient Descent. It performs better than vanilla gradient descent descent, as is the go to optimization technique when dealing with simple neural networks such as this. As mentioned above, we can see the loss.backward() function call being made here. This is where the ‘learning’ of weights is happening. This will continue till the number of epochs that we specify, and the weights will change based on the learning rates that we specify. This takes place in the inner for loop . Then the same is carried out on the validation dataset. We keep a track of this learning in a variable called history .

Moving to GPUs

This code can entirely be run on a CPU, but in case you have a GPUs lying around, this is your time to shine. GPUs have an edge over CPUs in terms of speed in parallel computations, and are very useful in deep learning problems such as these.

True tells us that torch was able to find the GPU. If you see False here, and have a CUDA capable gpu, then you might want to look into your drivers. If you are using Kaggle, set your accelerator to GPU, and change runtime to GPU in colab.

Now we move all the tensors to the GPU:

While we are at it, let us also define some helper functions to get a visual representation of our loss and accuracy of the predictions as we go over the epochs.

Moving the data to the device:

Training the model

So we have made a model. But does it know anything now? No. We need it to learn. Let’s train it.

Remember the dimensions of each image we checked in the beginning? Let’s define it in a variable. Also remember there were 10 output classes? That is the output size of our network.

Here is the architecture for our model. Here we are going ahead with linear layers. A linear layer without a bias is capable of learning an average rate of correlation between the output and the input, for instance if x and y are positively correlated => w will be positive, if x and y are negatively correlated => w will be negative. If x and y are totally independent => w will be around 0. Source: https://medium.com/datathings/linear-layers-explained-in-a-simple-way-2319a9c2d1aa

Here we have a basic neural network that has an 3 hidden layers of size 256, 128 and 64 neurons. I have achieved maximum accuracy with this accuracy with this model after trying various architectures, but there might almost certainly be a better combination which might give you better accuracy than mine. Experiment and try what works best for you.

Keep in mind we have used linear layers here as opposed to convolutional layers, which might perform better, or pre-trained models such as the ResNet, in the aim of keeping this article beginner friendly. I will cover convolutional layers for sure in a later article!

Moving the model to the device:

Before you train the model, it’s a good idea to check the validation loss & accuracy with the initial set of weights.

The model is making guesses based on the initial random weights here, without any sense of learning, and hence the accuracy is terrible (9%).

Now we use the previously defined fit() function to train the model.

Here, 10 is the number of epochs, and 0.1 is the learning rate for these epochs. We can see the accuracy gradually increasing.

We can see how the accuracy of the model increased as the training continued, in the form of the graphing function we defined above.

Evaluating the model’s final performance:

And hence we come to a close of this tutorial. My model reached an accuracy of 51% with only such few epochs and linear layers. This can definitely be improved upon, as we will see in my next post.

Credit to Akash NS from whom I have borrowed parts of this code. Check out his channel at https://www.youtube.com/channel/UCEkIfTA9fTlly9bq5Hg-uzg

Thanks for reading, and do point out any mistake if you find them!

Introduction to image classification with PyTorch (CIFAR10)

Preparing the data for training

Moving to GPUs

Training the model

Written by Dev Bhartra