Making my own deep learning image recogniser

Published in

Chris Nielsen

7 min readMay 26, 2017

Right now, there is a huge opportunity for anyone who can figure out how to use AI in their business. Microsoft, Amazon and Google have all released easy to use AI cloud services which means that suddenly these features are within reach of everyone. But before you can think about where to use AI, you need to understand how it works. At least at a high level.

So in this post, we’re going to learn about AI by making an image recognition service. It’s a great way to get a feel for how the big companies probably made their services. We’ll also get to figure out the kind of things AI is good at and the kind of things that it is bad at.

Convolutional Neural Networks

We’re going to make our image recogniser with a convolutional neural network (or CNN for short). CNNs are a kind of neural network inspired by how we think the visual cortex works in animals. CNNs had a major break out moment in 2012 when Alex Krizhevsky won the annual ImageNet Large Scale Recognition Challenge. It’s like the Olympics for deep learning. He won with a CNN that could identify the subject of 100,000 photos with 85% accuracy. This was a shocking improvement over previous years.

Now CNNs are everywhere. In fact you’ve probably already used them without knowing it. Rumour has it that Facebook uses CNNs for it’s automatic tagging, Google uses them for photo search, Amazon uses them for product recommendations and Pinterest uses them for image search.

The way CNNs work is quite intuitive. When people look at a picture of a face, we subconsciously know it’s a face because we see features like eyes, eyebrows and a mouth. CNNs work the same way. They start by identifying simple small features like lines and circles. They then build these simple features into more complex ones like eyes and noses. Finally the CNN aggregates those features into complete faces. The figure below shows how a CNN can be trained to understand what a face looks like. Features are built up from left to right.

VGG16

For our image recognition service, we’re going to use a CNN designed by the Visual Geometry Group at the University of Oxford. They used it to win the ImageNet challenge in 2014. The benefit of using their design is that we know the structure works because they demonstrated that it is 92.5% accurate when identifying objects in the ImageNet photos. Even better, they’ve actually already trained it with 1.3 million training images and then provided the network weights for download.

Their research paper describes how they arrived at this specific design. From what I understand, they just tried a bunch of different CNNs and picked the one that performed the best.

We joke, but there is an important point here. Designing deep learning systems is a discovery process. This is why companies with a lot of data and processing power (Google, Amazon, etc) have a huge advantage. They’re just able to search more.

Show me the code

Ok, enough talk. Let’s build a CNN to figure out what is in this picture.

We’re going to write our CNN in Python using Google’s open source machine learning library Tensorflow. We’ll also use a library called Keras that provides a really nice API on top of Tensorflow. Don’t worry if you’re not a programmer, the 16 layers of the model are quite simple to read.

In this code we’re creating a model and adding the layers one at a time. A CNN has a few kinds of layers.

Conv2D are convolutional layers that identify and build up features. There are some MaxPooling2D layers which pool adjacent features together and prevent over fitting. Finally with have Dense layers which are fully connected layers to map our image features out to the 1000 objects.

The details of CNNs are complicated and way more than I can handle, especially in this short post. If you’d like to learn more I’ve put some references at the end.

So now we’ve created the model, we need to load in the layer weights. Normally you’d have to train the CNN but the Visual Geometry Group from Oxford were nice enough to provide their pre-trained weights for download. They’re pretty big. 553.5MB of just layer weights.

Before we can send an image into our CNN we need to scale it to match the first layer in the model which is 244x244x3 pixels. 244 pixels wide and tall then 3 pixels for red, green and blue.

The output of the CNN is a set of predictions. We can look them up in a list of 1000 objects.

Let’s test it out

Now let’s try it out and see how it does.

The model is 93.0% sure that the picture contains an elephant. Ok, let’s try another random picture.

Cool, it’s 91.3% sure that this is a picture of a banana.

Surprisingly the program actually ran quite quickly on my 4 year old Macbook Pro. It only took about 4.5 seconds to create the model and load the weights, then only about 0.5 seconds to classify an image. It also needed 600MB of RAM, mostly to load 500MB of model weights. I guess running a CNN isn’t too computationally expensive, it’s only the training that takes a lot of processing power.

I could imagine that if you had a few trained models you could easily run them on the same image in parallel. For example, you could use this CNN for object identification and then run a few other trained models for things like facial recognition and reading text.

Ok, let’s try some other things.

ImageNet doesn’t include faces, so VGG16 wasn’t trained to identify people. But let’s see what it does when it see’s a face.

No-one has ever told me I look like a burrito before. At least it’s only 1.8% sure I’m a burrito. But if you think about it, this is actually a good result. This is the CNN saying that I just don’t look like the 1000 things that it’s been trained to identify. But if it had to pick, I’m closest to a crutch, spatula or burrito.

So how does our model deal with abstractions, like symbols and illustrations.

ImageNet has only trained our CNN to identify apples in photos, so it obviously doesn’t see apples the way we do.

Ok, now for the final and most difficult test.

For humans, the task of recognising objects is one of the first things we learn. We do it effortlessly from a very young age. When my daughter was 2 we would read a “First Words” book together. So let’s see if our model can outperform a 2 year old human at identifying objects in pictures.

So that’s 6/9 (66.66%) accuracy. I’m certain my 2 year old daughter could have identified a teddy bear but the AI isn’t too far off. I’m impressed to be honest. It is kind of magical to have an AI reading a sight words book.

In conclusion

I really had a lot of fun writing this post. I feel like now I know enough to confidently use a cloud AI service. I feel like I would at least have a feel for where this kind of technology would work and where it wouldn’t.

Deep learning algorithms are incredibly complicated to design and train. But using them can be fast and effective as long as you only use them to do exactly what they were trained to do. Otherwise, well, the mistakes are hilarious.