Dogs vs Cats is too easy

A harder problem: Fine-grain classification

Published in

Predict

6 min readOct 30, 2018

The Dogs vs Cats problem has become a sort of hello world for deep learning folks. Many courses and blog posts that are aimed at beginners use it as the fundamental building block for an introduction into the exciting world of deep learning.

Fastai has not been any different with past versions of the course focusing on building a Dogs vs Cats classifier in the first lesson. But this year the course is taking a different approach because as Jeremy Howard puts it,

We have gotten to the point where deep learning is so fast and so easy that the Dogs vs Cats problem, which a few years ago was considered extremely difficult is now too easy. 80% accuracy in the past was considered state of the art. Our models are now basically getting every thing right all the time without any tuning.

So the Dogs vs Cats problem is no longer challenging which means there is no opportunity to try out new ideas. A more challenging problem that Jeremy suggests is fine-grain classification. With this problem, we train models to distinguish between similar categories. An example which is also used in the Fastai course is the Oxford-IIIT Pet Dataset which has 37 classes of different species of pets.

In this post, we are going to build a fine-grain image classification model to detect which species of fish appears on a fishing boat, based on images captured from boat cameras at various angles. Eight target categories are available in this dataset which is available on kaggle.

Nature Conservancy Fisheries Monitoring

Illegal, unreported, and unregulated fishing practices are threatening marine ecosystems, global seafood supplies and local livelihoods. Deep learning has the potential to dramatically scale the monitoring of fishing activities to fill critical science and compliance monitoring data gaps. — The nature conservancy

The development of accurate models for this problem is a critical factor in ensuring that efforts aimed at conserving our fisheries are successful.

From the images, we can tell tell that this problem would require a great deal of effort. With some of the images, it will be difficult even for a human to identify the species of the fish that has been caught.

To build our model, we are going to use the fastai library which makes it incredibly easy (believe me) to build deep learning models.

Our first step will be to load and normalize the data.

Data normalization is an important step which ensures that each input parameter (pixel, in this case) has a similar data distribution. This makes convergence faster while training the network. Data normalization is done by subtracting the mean from each pixel, and then dividing the result by the standard deviation. — Nikhil Balaji

This creates an 80/20 split for our data, 80% of the data is used for the training set whiles 20% is used for the validation set. The concept of training, validation and test sets can be very much likened to a classroom setting. The lessons and assignments are the training set, the quizzes are the validation set and the final exam is the test set.

In a class you learn (train) from the lessons you take as well as the assignments that are given to you. Quizzes are given to validate and give useful feedback to the teacher concerning the effectiveness of her approach. The teacher (deep learning practitioner) uses this feedback to adjust her approach with the aim of getting better results (accuracy or any other metric). The final exam gives a more holistic assessment of the performance of the students in the class and shows how well the teacher was able to train the students (model).

Creating the Model

We create our model using a pre-trained resnet50 model to perform transfer learning. Transfer learning allows us to use a model built for another task as the starting point for the model we want to develop for our own task.

The idea is to take the knowledge learned by the pre-trained model and apply it to our task. A neural network is trained on data which in the case of ImageNet is 1.2M labeled images. This network captures universal features of images like curves, shapes and edges.

So instead of training a new network and having it learn all these features from scratch, we transfer the knowledge learned by the pre-trained network to our problem domain.

This pre-trained model does not know about species of fish, but it certainly knows something about recognizing some kind of images already. We don’t start with a model that does not know anything, we start with one that knows how to recognize categories of things in the ImageNet dataset.

With transfer learning we can get state of art results in a short time even with a small dataset because we are utilizing the knowledge gained from previously training on a larger dataset. You can read more about transfer learning here.

After training for 5 epochs, we end up with an error rate of 0.093. That is, our model is making the wrong predictions 9% of the time.

Fine tuning

For such an important problem 9% error rate is not too good. So to improve upon our accuracy, we turn to fine-tuning. In the first phase of training, we only updated the weights of the newly initialized fully connected layer. We did not update (freeze) the weights of the pre-trained model.

We unfreeze the weights of the pre-trained model and train the model again. This fine-tunes the weights of the pre-trained model, making it more fitting for our task of identifying fish species.

The intuition behind fine-tuning is that, the earlier layers of the network capture generic features of images like curves and edges. But the later layers learn features that are more specific to the data it’s trained on such as recognizing dogs and cats in the case of ImageNet.

When we fine-tune the network, we are essentially adjusting the weights of these layers so that they capture the features that are more important to the task at hand.

Top losses

Let’s take a look at some of the images which our model misclassified with a lot of confidence. This helps us to identify the kind of images that our network is having trouble classifying which will inform our decision making as to the next steps to take to improve model accuracy.

Our final error rate is 0.05 which can still be improved upon. However this shows how challenging fine-grain classification problems can be. Hopefully as we turn our focus to these kind of problems, we will be able to develop techniques that can help us achieve even higher levels of accuracy.

The code for this project is available on github.

If you enjoyed this piece, please recommend and share