How to Build a Landscape Photo Classifier in PyTorch | CNNs, ResNet and Transfer Learning
If you have ever used the Google Photos app, you might have seen the automatic labeling of images the AI performs to group your photos under a category, such as cars, people, cats etc. Here is a screenshot from my Google Photos library:
The Google algorithm has automatically labelled these images under various predefined categories. Now let’s build our own version of such an image classifier from scratch!
The Dataset
For a machine to learn, it needs data. And for this problem I went ahead with the Intel Images dataset, from a competition hosted by Analytics Vidhya and Intel. The dataset can be found here: https://www.kaggle.com/puneet6060/intel-image-classification. This is image data of Natural Scenes around the world. These are the kinds of images one generally finds in a persons photo library. You can work with a dataset of your choice as well, the procedure will remain the same.
This dataset contains around 25k images of size 150x150 distributed under 6 categories. Let us explore the dataset a bit more.
Before going ahead with data exploration, we first import the necessary libraries. If you do not have a PyTorch setup, I recommend you to read this article and then continue with this tutorial. Or, you can use an online cloud compute platform, like Kaggle or Google Colab.
Now we can get a better look at the dataset that we are dealing with.
As we can see the data folder contains 3 sub-directories, one each for training, testing and making predictions. For this tutorial, I will be using the entire test set for validation.
Before we load the images into their respective data loaders, we need to define some image transforms, for the purpose of data augmentation and to reduce over-fitting.
Let us now define custom datasets, while performing the transforms defined above.
As we can see, the training dataset has more than 14,000 images, and the test set has 3000 images. While more data is definitely preferable, this will suffice, as long as the number of categories isn’t too high.
To load the images from the datasets into PyTorch dataloaders, we need to define a batch size.
It is very important to include shuffle = True
, to prevent the model from training on repetitive data, and thus resulting in over-fitting. num_workers
allows for parallel processing while loading the images. If you load your samples in the Dataset
on CPU and would like to push it during training to the GPU, you can speed up the host to device transfer by enabling pin_memory
. This lets your DataLoader
allocate the samples in page-locked memory, which speeds-up the transfer. (source). Needless to say, the images are loaded in batches, as defined above.
Normalizing the images across the RGB channels helps in most cases, but didn’t make much of a difference for me, so I have commented it in the transforms code above. But the code for finding the mean and std deviations for the data is as follows:
We now take a look at the classes of images in this dataset.
We can also check how many images are present under each category:
The data is pretty balanced, and thus we do not have to worry about skewed data, which can be a big problem.
Now let us look at some images that are present in this dataset to get a better idea of what we are dealing with.
These are typical shots of various landscapes that we will be classifying on the validation set.
To get an insight on how a batch of data looks when being processed, we can write a simple function:
Moving to NVIDIA GPUs (or CPUs)
While I highly recommend you to run this code on a GPU, unless you want to wait for hours and hours to get results, you can also run this code on a CPU. If you don’t have an NVIDIA (sorry team red) GPU, you can use the ones at Kaggle or Colab for free!
Moving the dataloader to the device of choice:
Finally, the models
I have experimented with various types of models to see what architecture works best for this dataset. The science
in data science is all about experimenting after all.
First, let us build a very simple Convolutional Neural Network, and put it to train.
class convNet(ImageClassificationBase):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2), # output: 64 x 16 x 16
nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2), # output: 128 x 8 x 8
nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2), # output: 256 x 4 x 4
nn.Flatten(),
nn.Linear(82944, 1024),
nn.ReLU(),
nn.Linear(1024, 512),
nn.ReLU(),
nn.Linear(512, 6))
def forward(self, xb):
return self.network(xb)
(If you are following the code until this point, don’t paste the above code block just yet! We will get to this in a while.)
An alternate to a simple convolutional network is a Residual network, or Resnet.
Why ResNets?
Deep networks are hard to train because of the notorious vanishing gradient problem — as the gradient is back-propagated to earlier layers, repeated multiplication may make the gradient infinitely small. As a result, as the network goes deeper, its performance gets saturated or even starts degrading rapidly.
Before ResNet, there had been several ways to deal the vanishing gradient issue, for instance, auxiliary loss in a middle layer as extra supervision, but none seemed to really tackle the problem once and for all. Until ResNets.
A residual network consists of residual blocks. A residual block looks like this:
To understand the true beauty of ResNets, please go through this paper: https://arxiv.org/pdf/1512.03385.pdf. For now we will concentrate on the implementation.
We can see the above lines of code directly represent the block as shown in the image.
To get an idea of the shape of the images being passed to the network that we will be making:
Now, a few helper functions for calculating the important metrics:
These functions can be found in almost every image classification tutorial, and will work in most scenarios without much need for change.
And finally, the two architectures:
I have written both the networks together so you can compare the two architectures. The ResNet is ResNet9. You can read about the ResNet9 here: https://myrtle.ai/learn/how-to-train-your-resnet/
Let us take a better look at the ResNet we have made, while also moving it to the GPU (or CPU):
The parameters 3 and 6 represent the number of channels and the number of output classes respectively.
The network has 9 main layers, a mix of conv layers, sequential layers, and residual blocks, hence being called ResNet9. There are many variations of the ResNet, such as ResNet34, ResNet50, ResNeXt, etc.
Now before we begin training, some special techniques:
- Learning rate scheduling: Instead of using a fixed learning rate, we will use a learning rate scheduler, which will change the learning rate after every batch of training. There are many strategies for varying the learning rate during training, and the one we’ll use is called the “One Cycle Learning Rate Policy”, which involves starting with a low learning rate, gradually increasing it batch-by-batch to a high learning rate for about 30% of epochs, then gradually decreasing it to a very low value for the remaining epochs. Learn more: https://sgugger.github.io/the-1cycle-policy.html
- Weight decay: We also use weight decay, which is yet another regularization technique which prevents the weights from becoming too large by adding an additional term to the loss function.Learn more: https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab
- Gradient clipping: Apart from the layer weights and outputs, it also helpful to limit the values of gradients to a small range to prevent undesirable changes in parameters due to large gradient values. This simple yet effective technique is called gradient clipping. Learn more: https://towardsdatascience.com/what-is-gradient-clipping-b8e815cdfb48
Time to put our network to test:
As (un?)expected, the accuracy is terrible. Why? Because we have asked our model to make predictions without teaching it how to answer them! Currently, the network is making random predictions. Now we begin the training:
In just 10 epochs and 13 minutes of training time, we have attained an accuracy of over 90%! Let us visualize this learning process:
We can see the 1 cycle policy in action with the help of the above graph.
Now, to see why we went ahead with the ResNet approach, let us also train the ConvNet that we created.
Simple ConvNet:
Evaluating it without training:
Initially we can see that it has a marginally better guess than the ResNet architecture. Let us train it.
I have kept all parameters the same, except the number of epochs, to allow it to train for longer.
As it can be seen, the network begins to max out at a maximum accuracy of about 86%, thus proving how resnets are much better.
An interesting thing to note is, with the absence of Residual blocks, there is a large amount of overfitting taking place, which can be noticed with the help of this graph:
Transfer Learning
Now, what if I told you, that we can achieve similar and sometimes better results, without even having to make a model ourselves? That is the beauty of Transfer Learning
It is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems. (source)
The first architecture we will be Alexnet. You can check the details of Alexnet here.
We can get pretrained models using the torchvision library.
Make sure you ‘freeze’ the earlier layers and only train on the later layers in a pretrained model. requires_grad = False
will ensure this.
The first line of the above cell is used to fine tune Alexnet, which by default is used to work with the ImageNet dataset, having 1000 categories. We have 6, so we change out_features
to 6.
The loss function used here is CrossEntropy
which performs reasonably well in tasks of single label image classification, along with the Adam optimizer.
And the final training step:
We can see AlexNet, in just 15 epochs reached a score of 87%, which is higher than our simple convolutional network.
I have then tried the VGG19 architecture as well, and this performed the best out of all the other pretrained model:
We again fine tune the network as we did for AlexNet:
Final training code:
The model peaks at 90%, after which there is some overfitting in the further epochs.
I have tried other models such as GoogLeNet, ResNet50, etc, but they all were much larger than AlexNet and VGG19, and too many layers can lead to overfitting. Thus smaller models, for data as less in volume as this, perform better.
Conclusion
So there it is. Image classification is easier than you think. Achieving an accuracy of over 90% can be done in a few minutes. Let me know in the comments if you have any questions!
Connect with me on Twitter https://twitter.com/devb183 where I update my new posts!
Check out my other blogs on Medium:
Thank you!