Hot Dog or Not? A lesson on image classification using convolutional neural networks
I’ll begin this post with one of my favorite Silicon Valley scenes ever:
When this episode came out, I was pretty far removed from the Data Science world. I thought it was hilarious and clever, but I’d never even begun to think about the technicalities of how Jian Yang was able to detect whether his photo was a hot dog, or not a hot dog.
So when I learned about Neural Networks, specifically using convolutional layers, I connected the dots immediately and knew I had to create this “technology” for myself. I partnered with a classmate of mine, Vishal Patel, to build this out as a side project so we could present our process as a technical talk.
This post will basically be a written-out version of our talk. I’ll go over how images are stored in data formats, provide a basic understanding of Neural Networks, walk through how to implement a Convolutional Neural Network for image processing, all while using our “Hot Dog Or Not Hot Dog” example. Ready?
How images are stored in data formats
Images, like all pieces of data, are actually just broken down into a bunch of numbers. Instead of thinking of an entire image as “one piece of data”, think about each pixel of an image representing a number between 0 and 255. The number depends on how much the pixel’s color contributes to the image. Then when you have an entire image (let’s say 120x120 pixels), you have a 120x120 matrix of data, like so:
This is a black and white photo. See how the super black pixels are 255, and the whitespace is all 0's?
With colored photos, it’s the same idea, but this time you have three layers on top of each other — a red, green, and blue layer. Each pixel is still a number from 0–255. Now you have a 120x120x3 matrix:
And for hot dog purposes, here are my RGB layers for a hot dog photo:
All three photos on the right come together to create the beautiful hot dog on the left, vibrant cheesy colors and all. Also, note how I can plot each photo on a graph!
Intro to Neural Networks
Now that we know how images are stored in data formats, we can understand how they are processed and fed into Neural Networks.
Neural Networks are named the way they are because they parallel the architecture of actual biological brains. The network is based on a simple form of inputs and outputs. Let’s think about how biological neural networks in the human brain processes information.
Do you see the similarities? That “cell body” on the right image is called a neuron for our purposes. It takes in different features and weights (the lines pointing toward it), processes those features and weights, and provides an output.
A neural network is made up of many, many neurons, all connected to each other through what we call “hidden layers”. Let’s take a look at the architecture of what our Hot Dog neural network might look like:
You start with putting in all of your features in that first input layer. Those features pass through those hidden layers to find patterns and similarities, then give a certain output.
Let’s put this into context using our own brains. Let’s say I’m walking down the street, and see this thing that 1) has four legs, 2) has a tail, 3) is barking, 4) super fluffy, and 5) very cute. Those are our five features that we’ll feed our brain’s input layer. My brain will do all the work to connect those features together. In the process towards one neuron in one hidden layer, it could be thinking, “Four legs and has a tail! Could be a horse.” And another neuron, “Super fluffy and very cute! Could be a Pusheen plush toy!”. My brain would do this over an over again for all of the features in the hidden layer to then decide that it’s a dog.
For Hot Dog image purposes, each feature is a number. Remember that 120x120x3 matrix? That’s 43,200 numbers that it’s taking in. The hidden layer identifies edges and color clusters and much more. For example, if it sees a long cluster of bready colors between a long cluster of meaty colors, the output layer might decide that it’s a hot dog.
The hot dog example is binary, meaning there are only two outputs. It’s either a hot dog or it’s not. This is what really disappointed the rest of the Silicon Valley crew, because they wanted the app to be able to detect *any* type of food. This would require a multi-class neural network, where the output layer would be millions of different types of food.
Here is an example of a multi-class neural network, where the input layer takes in all the different features for all of these different images of faces and then categorizes them by who it thinks the image is of. The output class could be individual people, but it could also classify, for example, age ranges of people, ethnicities of people, hair color of people, etc.
Of course, this means that you have to train your model on all of these people. Or in our case, we trained over 1,000 photos of hot dogs and over 3,000 photos of things that weren’t hot dogs. Yeah, right now my computer has over 4,000 images of food on it.
Convolutional Neural Networks
In mathy terms, a convolution is measuring how much two functions overlap as one passes over the other. So in terms of image processing, the convolutional layer of the neural network passes a filter (or window) over a single image and searches for specific features.
This process essentially turns this photo:
Into another image, broken down into equally-sized image tiles:
And then we can feed each tiny image tile into the network again. This step essentially reduces the dimensionality of the original image. To reduce it again, we introduce a method called “Max Pooling,” that essentially takes the entire array and only keeps the most important features (aka the biggest number). So, we started with one giant image and kept breaking it down to result in a small(er) array that we can finally put back into our fully-connected neural network.
That was a ton of information! For a more in-depth explanation of Convolutional Neural Networks, Max Pooling and Downsampling, I highly recommend this Medium post (where I got the images of the kid). It was so helpful in our research process!
Results
After messing around with multiple convolutional layers, drop-out regularization, testing out different hyperparameters and waiting around for many, many epochs, we reached accuracy scores around 70% to 74%. This was just *okay*, considering we had a (purposeful!) class imbalance.
Transfer Learning
In the end, we got the best scores applying a method called Transfer Learning. This is the re-use of a pre-trained model, and we got ours through Keras. In this case, these transfer learning models are pre-trained on millions of images for thousands of classes. Vishal tested several pre-trained models (VGG16, InceptionV3, and MobileNetV2) and then tuned them to be adapted to our binary hot dog classification.
With Transfer Learning models, our accuracy scores went up higher than 90%, and almost every image we tested it on was correct.
Testing, Testing!
This guessed with 100% confidence that this image is a hot dog. And it is!
Am I a hot dog?
Is this sub sandwich a hot dog?
We tried over and over to “break” our model. During our talk, we asked took live image suggestions and someone so smartly asked us to test an eclair!
Our model guessed with nearly 100% confidence that this eclair was a hot dog! It is a very hot dog-looking eclair. But still not a hot dog.
This process was such an amazing learning process and experience for us. The goal of this post was to share a bit of the knowledge we picked up along the way. Next time Facebook auto-tags your photos, or your iPhone groups your photos by person or event, I hope you have a better idea about how it’s done!