Case Study: A world class image classifier for dogs and cats (err.., anything)

It is amazing how far computer vision has come in the last couple of years. Problems that are insanely intractable for classical machine learning methods are a piece of cake for the emerging field of deep learning (DL), aka deep neural networks (DNN). Even more outstanding is that the foundation of deep learning models is both mathematically and conceptually very easy to understand. Their implementations can arguably be quite complex, but the proliferation of DL frameworks mean that state of the art result for many traditionally hard computer vision problems is achievable in a few lines of code these days.

In this blog, we’ll attempt to use the fastai library to build an image classifier that works amazingly well for a classification task. The library (thanks to Jeremy Howard and his team at USF) makes deploying a model as easy as flipping a burger, so well, you can both make your burger and eat it!

The first step to take is to download the dogs and cats dataset from kaggle, and unarchive it at a certain path in your machine. Then you need to get a copy of the fastai library from github, and install the proper development environment (i.e. Python 3.6, Cuda 8.0, CudaNN 6.0, Jupyter notebook etc.).

Model Architecture: We’ll be using the ResNext-50 architecture for this case study. This model was also the runner-up in the 2016 ILSVRC competition, and is arguably one of the state-of-the art in computer vision models.

sz=299            # intended image size
arch=resnext50 # intended architecture

Data Sampler: We’ll be using a data-sampler that randomly provides horizontally flipped versions of sample images, with random zooming by a factor of 1.1

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms, bs=bs, num_workers=4)

Learning Model: The following line instantiates the model that we’ll use to sperform classifications with. The Resnext model we use will be a pre-trained model on the imagenet database.

learn = ConvLearner.pretrained(arch, data, precompute=True, ps=0.5)

In the following line, we actually go ahead and fit the model.

learn.fit(1e-2, 1)
>> #(epoch, train_loss, val_loss, val_acc)
>> [ 0. 0.04081 0.01944 0.99388]

And bam! We see that the model is already predicting dogs and cats with an accuracy of over 99% ! It is also worth mentioning that this prediction is based on the validation set, i.e. images of cats and dogs that the model never got to train on. Very impressive, if you ask me.

Now, let’s take a quick visual of how many cats and dogs the model got wrong. This is done by using a confusion matrix which is pretty self-explanatory:

Confusion Matrix for the Cats and Dogs (Epoch 1)

So, it seems like we incorrectly predicted 10 images of cats as dogs. Likewise, we incorrectly predicted 5 images of dogs as cats. Hmmm.. let’s see what is really going on!

Images with truth labels of dogs, but that were labelled as cats by the network.

In the row of images above, the first two images are predictions that the network seemingly got wrong. One could argue that the first image has a dog that might resemble as a large cat, while the second image is quite blurry. The image on the middle is practically indistinguishable from a cat, while the two images on the right are neither that of a dog or a cat. No surprise then that the network would not get their class labels right.

Now let us look at the images that had the truth label of cats but got classified as dogs. Looking at the first two images in the first row, we see that the model thought they were dogs with a very strong confidence (i.e. the closer the number is to 1, the stronger the prediction that it is a dog). In fact, these images do have a picture of a dog each occupying a much larger volume of the image. No wonder then that the classifier (or for that matter, any human) would associate these images to the dog label.

Images with truth labels as cats, but which the network thought were dogs.

There are two other images that is clearly not a dog or a cat, but mere posters. The fact that the network thinks these are also dogs with rather large confidences (0.9845 and 0.938929) suggests that the model could use some improvement. Additionally, there are a number of images of cats which the network has classified as dogs, but with much smaller confidence (i.e. the further the confidence is from 1.0, the more uncertainty that it is a dog). Remember that we only fitted the model once, so we have plenty of opportunities to do better.

In the next round of learning, we continue to fit our model using some very generalizable training tricks. We actually apply different learning rates for different sections of the Resnext architecture. The learning rate is also varied across iterations as shown below, and the learning rates reset to the initial magnitudes at the start of a new epoch. Additionally, the model weights are saved at the end of each epoch, and these weights used to make an averaged prediction at the end.

The code for the above is packed between these two lines:

lr=np.array([1e-5,1e-4,1e-3])  # set a differential learning rate
learn.fit(lr, 2, cycle_len=1) # fit the model with cyclic method

The principal motivation for varying the learning rates, and taking averages of the prediction of the model from each epoch is based on the paper on snapshot ensembles. The core idea behind doing such is that varying learning rates, as well as taking an ensemble of predictions helps us find a minimum that is more stable and resistant to data perturbations. Thus such models should generalize better to unseen data, and help us actually use the model for useful future work.

The motivation for using different learning rates is something that was brought originally for Ver. 2 of FastAI’s Part I class. The intuition driving this thought process has its roots in Matt Zyler’s paper on deconvolution layers. It is shown that the shallow convolution layers (i.e those closest to the input images) are useful for learning basic geometric edges, thus we chose to barely, if at all, update the weights on these layers during training. In the same tangent of thought, the middle layers are shown to be detectors of higher order shapes (e.g. arcs, circles, patterns etc.), thus we choose to update the weights on these layers with a slightly larger learning rate. The deepest layers, i.e. the fully connected layers that we custom injected, need the most amount of updating and thus have the largest learning rate ascribed to them.

After running the training for a couple of epochs with the differential learning rates, followed by a varied learning rate we arrive at the following situation:

#(epoch, train_loss, val_loss, val_acc)
[ 0. 0.0551 0.02114 0.99355]
[ 1. 0.03563 0.01902 0.99405]

We see very slight improvements in both the validation loss and accuracy. It might not seem like a lot, but in fiercely competitive scenarios such as a kaggle competition, every bit of gain helps. Let us look at how the cats and dogs images are now classified using the confusion matrix.

Voila! Off of 2,000 images of cats and dogs, we’ve managed to fail on only 8 of them. And, we achieved to do that by running no more than 5 to 7 total iterations. I believe this is an incredible feat, given that we were able to accomplish this with pretty much off-the-shelves components and with very little, if any, customizations.

Looking ahead at the wrongly classified images, we see that the network is pretty much behaving like a human. As can be seen, three of these images are just posters, and one of these images have both a cat and dog in it. The image on the fourth column is a dog inside a cage, and for any model, we’d expect it to be a tough nut to crack. Thus with some meaningful intuition, we find ourselves almost forgiving of these misclassifications.

Left 3: Images with truth label of cats, but classified as dogs. Right 2: Images with truth label of dogs, but classified as cats

The row of images below show images of dogs classified as cats by the network. Except for the first one, all the other images are indeed very poor representations of dogs. One is the underneath of a dog’s belly, the other a close snapshot of a dog’s teeth.

Images (all) with truth labels of dogs, but were classified as cats.

I hope this was a gentle introduction to the kinds of datasets that deep-learning particularly excels at, i.e. images, texts, and signals that exist in massive quantity in the increasingly digital world. The page showing these processes are described with a lot more depth, discussion, and detail in this notebook, although by default, it uses the Resnet-34 architecture. A more sparse version of this notebook using Resnext-50 should be found here, and it is possible that at the time of viewing, you might see slightly different results. Deep-learning, at its heart, is stochastic, so the results will slightly vary between consecutive executions.