Can we beat the state of the art from 2013 with only 0.046% of training examples?

In 2013 Kaggle ran the very popular dogs vs cats competition. The objective was to train an algorithm to be able to detect whether an image contains a cat or a dog.

At that time, as stated on the competition website, the state of the art algorithm was able to tell a cat from a dog with an accuracy of 82.7% after having been trained on 13 000 cat and dog images.

My results

I applied transfer learning which is a technique where you take a model trained to carry out some other though similar task and you retrain it to do well on the task at hand.

I fine tuned a VGG19 model on a total of 6 randomly selected images (you can find the pictures of our protagonists below).

I achieved an accuracy of 89.97% after 41 epochs of training. The validation set size was 24 994.

Being a fan of reproducible research, please find everything you need to run the experiment yourself in my repository on github.

What happened

This is thoroughly unexpected. The technique that I used is covered in the first lecture of Practical Deep Learning for Coders, part 1. In the Jupyter notebook provided with the course, it takes 7 lines of code to perform transfer learning.

This means that anyone who can move files around on a computer can apply this cutting-edge technology to a problem of their choosing. Medical diagnosis, anomaly detection, industrial applications of image recognition, you name it. Yes, you still need some data and you still need to have some high level understanding of what supervised learning is and how it works, but that’s about it.


The results are staggering. I didn’t have to apply data augmentation, didn’t adjust the learning rate nor had to care about regularization. I didn’t even test different architectures — this is literally the first one I tried.

And yes, one could say that telling a cat from a dog in a picture is not rocket science. But let me remind you that we managed to land a man on the moon and still 40 years later we were unable to tell our computers how to perform on this seemingly simple task with above 85% accuracy. And yes, it is true that the model I picked to fine tune was trained to perform well on visual recognition tasks.

But wait a second — think on the first two paragraphs of this post for a second please. We are beating state of the art results from 4 years ago and doing so effortlessly. I am running a supercomputer in the cloud at a cost of ~ $0.20 an hour (that is how much I pay Amazon for renting out the virtual machine). And state of the art means literally the best technique in the world applied to a specific problem. This is very significant.

This demonstrates that the limits of applications of Deep Learning today are no longer driven by technology — we have the hardware and the software needed. And yes, for some tasks we will need even faster processing units, even more data, even better algorithms. But there exists a universe of applications of Deep Learning today that is waiting to be discovered and the limiting factor is how quickly the knowledge of this technology spreads.

So coming from a person who quit college after a year and a half of majoring in sociology, who learned to program on his own as an adult and is by no means a programming guru, and who with just one afternoon’s worth of work beat the state of the art results from 4 years ago with only 1/2166th the data, my question to you today is this — what application of this technology will you invent to make the world a better place?

PS. Machine Learning Attacks Against the Asirra CAPTCHA by Phillipe Golle is the paper on the state of the art solution from 2013.

PS. 2 The winning entry to the Kaggle dogs vs cats competition had an accuracy of 98.914% and was achieved after carefully training a machine learning system on 25 000 images.

Further discussion of results: After I shared the article on Twitter, it led to a very interesting discussion that you can find here.

One very valuable comment was made with regards to the original VGG19 model being trained on classes that contained cat and dog breeds. I was hoping to only use the convolutional layers for essentially shape and low level feature detection, but quite likely they also contain higher level information. If that were to be the case, than the fully connected layers I added might not be doing a lot of original work and could just be learning to listen to the original convolutional layers providing them the answers.

If you found this article interesting and would like to connect, you can find me on Twitter here.