Getting Computers To See Better Than Humans

This post looks into a deep learning model that predicted with 93.8% accuracy what breed a dog is. The prediction was made on pictures of dogs it had not seen before. I explain how I built this model and also the use-cases for such a model in other industries.

This was part of a Kaggle competition, and at the time of submission, was in the top 8%. Currently attending the new version of the Deep Learning course as an international fellow, and I highly recommend this course for anyone looking to get into Deep Learning. The new course isn’t out to the general public yet, but it should be live towards the end of December.

Building this model has helped me realise a few important things:

  1. That you don’t need lots of data to build your model. By using Transfer Learning, you can build on what others have done in the past, and use their successful model to give you better results.
  2. If hospitals want to detect different types of cancer from CT scans, or if a mobile app wants to allow their customers to upload any image — and then recommend similar products, it can be started by using a similar model. Computer Vision is hot right now, and most of what is being done by startups in this field falls under this category.
  3. You don’t need a huge team or a huge budget to achieve production-quality results. I built this model (with a lot of help from on my own, and the GPUs for this was rented from Amazon. I used the p2.xlarge EC2 instance to build this model.
  4. And finally, computers can see better than humans :)

What is Deep Learning?

Normally when we program we tell the computer the logic, the exact steps to get a desired output. But in Machine Learning, the computer learns this logic on its own by looking at examples of the desired output. Deep Learning is one way of doing Machine Learning that uses large neural networks to mimic how the human brain processes information.

Analysing the results

In this dog-breed example, the dataset comprised of about 200 pictures each of 120 different breeds of dogs. This was used to train the model. Based on learnings from just these images, it was able to accurately predict with 93.8% accuracy what breed a dog is on pictures it had not see before.

Below are 4 pictures it got wrong. In most of these pictures the dog camouflages with the background. The model could be further improved by zooming out the irrelevant details.

Most incorrect predictions
The same logic used in predicting the breed of a dog, could be used to detect cancer cells in CT scan, or for analysing satellite images of a rainforest, or for allowing users to ‘search by image’ in an E-commerce app. The use cases for using Convolution Neural Networks are really only limited by your imagination.

Building a better Deep Learning model

Below is a brief summary of a few of the key practices involved in building this model.

Data Augmentation:

The more number of images you feed to a deep learning model in training, the better the performance of that model. So Data Augmentation is a great way to increase the number of images you use to train your model. It basically takes an image and creates a different version of that image by flipping the image vertically / horizontally, zooming, or rotating.

6 Different Versions of an Image using Data Augmentation

I used the library while building this model, and it was easy to turn data augmentation on just with a single line, just by passing aug_tfms (augmentation transforms) to tfms_from_model.

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

The only thing you have to be careful of is the way you flip an image. For example, if you were looking at a satellite image, along with the horizontal flip, you could also want to flip this vertically, and rotate 90 degrees. Therefore instead of applying transforms_side_on you could call the transforms_top_down function instead.

Calculating a Learning Rate:

The learning rate tells the model how quickly you want to update the weights. If you go too slow, it will take forever to train. If you go too fast, you may miss finding the optimal weights totally.

A common practice used to be to select a random learning rate, and then reduce the learning rate every few training cycles. A better approach was described by Leslie N Smith in his paper ‘Cyclical Learning Rates for Training Neural Networks’.

Here is a blog from Pavel Surmenok that describes this process in detail.

Stochastic Gradient Descent with Restarts (SGDR):

Instead of building this model from scratch, the model was built on resnest101_64, a model that is trained on ImageNet (1.2 million images and 1000 classes) as a starting point.

SGDR is a variant of learning rate annealing, which gradually decreases the learning rate as training progresses. This is important because as we get closer to the optimal weights, we want to take as small a step as possible.

But sometimes taking smaller steps may result in fragile models if we are in a narrow ‘spikey’ area. Instead we want our model to find spaces that are broader and more accurate, where small changes in weights will not result in big changes of loss.

From the paper Snapshot Ensembles

To prevent this from happening, we increase our learning rate every few cycles, so that the model can jump out of a ‘spikey’ area and find a more broader area. If it already is in a broad area, it will jump out and come back to the same area.

Differential Learning Rates:

The pre-trained model trained on ImageNet photos, already has the weights that we imported from this model. We are using these weights in our earlier layers, and we do not want to change the earlier weights too much.

Especially in dog-breed dataset (that was also taken from ImageNet), we can train the earlier layers at a really low training rate, and the middle and the final layers with two different learning rates, that are slightly bigger than the training rate in the earlier layers.

This process of giving different learning rates to different layers is known as differential learning rates.

Other Useful Mentions:

If we were training a model based on pictures that were totally different from ImageNet, then the initial layers would have to be trained on a learning rate that is higher than the learning rate used in the earlier layers in the dog-breed model.
One easy way to escape overfitting, is to train your model on small image sizes first, and then to gradually increase the size of the images. This again was more useful when the images are not identical to the ImageNet dataset.

Next Steps:

There are a few more practices which helped in building this model, and that will be shared in a later post.

Deep Learning is breaking all the industry records in Computer Vision, Natural Language Processing, Robotics, Medicine and many more fields, and I am interested to see how it can be used to help small medium businesses (that normally don’t have access to Google size datasets.)