| CNN for image classification

There are two approaches to learning a subject: Top-Down and Bottom-Up. In the bottom-up learning, you try to study every minute detail of every component of the subject and later learn to put them together. The deep learning analogy of this would be to learn all the linear algebra, probability and calculus you can and then try and understand how different CNN architectures work. Unlike this, employs the top-down approach. In the first lesson of the course, you’ll learn to build a decent-enough Dogs vs. Cats classifier that will put you in the top 1/5th of the leaderboard. As we move along, we’ll learn all the components of a good classifier.

Kaggle: Plant Seedlings Classification

In this post, we’ll try to build a classifier for Kaggle’s Plant Seedlings Classification competition. While doing so, we’ll go over the main ideas covered in the first 2 lectures of the course.
The course uses a custom library written on top of PyTorch.

The training dataset contains labeled images of 12 species of plant seedlings.

Species of plant seedlings
Seedling images

ResNet: The classifier

We’ll be using a pre-trained model called ResNet50. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point. ResNet50 is a version of the model trained on ImageNet data (1.2 million images and 1000 classes).


In the next sections, we’ll try to optimise some of the important parameters used in the classifier.

Data augmentation

One of the most important steps to build a good classifier is to give it more data to learn. This also helps in avoiding overfitting. One way to do so is to flip the images along the axes. In this case, because the images were taken from the top, it makes sense to flip them along the horizontal axis. The same goes when you’re trying to build a classifier to identify ships or ice-bergs in satellite imagery. If the image, however, were taken from the side (Dogs vs. Cats), we shouldn’t flip it vertically.

Data augmentation
Data augmentation: Do
Flip the cat sideways: Don’t

Learning rate

Learning rate determines how quickly (or slowly) we want to update the weights after each iteration. One way to pick the best learning rate is to plot the loss vs. learning rate curve.

Loss vs. Learning rate

In the graph above, as we increase the learning rate beyond the point between 1e-2 and 1e-3 the loss starts to increase. Because we want to minimise the loss, we should pick a point in the vicinity of the minima. We choose a point slightly less than the minima. So, for the above case, we should choose 1e-2.
Too high a learning rate and we risk missing the weights for which the network fits best. And, too low a learning rate, it will take a lot of time to find the best weights.

There are some other techniques to further tweak the learning rate. One such technique is learning rate annealing. Here, we start with higher learning rates and as we move closer to the optimal weights we start decreasing the rate.

Another variant of learning rate annealing is known as Stochastic Gradient Descent with Restarts (SGDR) as described in the paper: Cyclical Learning Rates for Training Neural Networks. In SGDR, we “restart” i.e we make the model jump to a higher learning rate after each mini-batch.

An example of cosine learning rate annealing with restarts

Another important learning rate optimisation techniques is using differential learning rates. The idea here is that for a pre-trained model like ResNet, we don’t need to fine-tune the earlier layers because they have general-purpose features. So, we’ll use different learning rates for different layers.

Differential learning rates

With the above techniques, I was able to get to the 22nd position in the competition.


The Jupyter notebook containing the above code is here.