Teaching Machines to Recognize Man’s Best Friend

How machines can understand the world around us

Published in

The Startup

7 min readOct 17, 2019

A dog is a dog. That sounds like a stupid thing to say, but what is a dog? Sure you can probably instantly recognize a dog in an image if you saw one, but how did you get to that conclusion? If you break down the question, what really makes a dog? If you went back a few thousand years, how would you explain the difference between how a cat looks and a dog looks?

You could say dogs have 4 legs and a tail, but so do cats. Even with that description there’s a lot of room for error. A dog with 3 legs is still a dog. Maybe you could look at the shape of their ears or the colour of their fur. Turns out it’s not that easy to describe how a dog physically looks and your description might only apply to a certain breed.

The idea of having a computer try to classify an object in an image has been around for a long time, but advancements in the field of Deep Learning have drastically how accurate they are. Why exactly is object detection important?

Self-Driving Cars and other autonomous vehicles seeing roads
Increasing concerns over facial recognition
Diagnosis from images in healthcare

So how exactly are computers able to figure this out?

How computers show images

Right now you’re reading this article on your computer or your phone (if it wasn’t obvious) and when you zoom in really closely you can see it’s made up of tiny dots/squares called pixels. Each of these pixels projects a single colour that when combined with the rest, can form entire images. That colour value is typically RGB or (red, green, blue) and each goes from 0 to 255, darkest to brightest. So something like (255,0,0) would be a completely red pixel, while (0,0,0) would be a black pixel and (255,255,255) would be a white pixel. Everything you see on screen is being created by millions of pixels.

A grid of RGB pixels. These fake pixels are made from real pixels.

For a computer trying to understand the contents of an image, having colours isn’t actually useful. Humans can still tell there’s a dog in the picture if it was black and white. So we can preprocess the data by turning the picture into grayscale. Now pixels have a single value 0 to 1 indicating its brightness. It is way more efficient to deal with one number in grayscale than managing three RGB values. This also makes sure that colour doesn’t play a major role, e.g. a yellow dog is still the same as a red one.

While we can see a picture on screen and identify it, the computer still only knows an image as numbers. So how can a computer figure how these numbers make a dog?

Your brain on math

Let’s take inspiration from how humans work. When we see something, our brain does some of its own calculations. First of all, light that bounces from an object enters our eyes. We can see our phone because light bounces off of it. Then when that light enters our eyes, it’s converted into electric signals transmitted by neurons so our brain interpret the world we see. Those signals are also interpreted as objects we know. So not only do we get a picture of our dog, but we recognize it as one.

If our brain learned how to classify a dog, then a computer can do the same with a neural network that imitates how our own brain works. We aren’t discussing any complicated math here, just an understanding of how a neural net works. When we feed the neural net preprocessed information it learns how important certain pixels are and then it spits out an answer. In this case we give it an image and it applies weights to the pixels, finally telling us how likely the image is a dog from 0 to 1.

If we want our neural network to classify multiple things, then we can have it output multiple numbers, with the largest value being what the neural network believes is in the picture. Who knew probability could be so useful!

Here the Neural Net has one output neuron. If we want to detect multiple objects we can add more neurons to the output layer.

At first it’s going to perform horribly since it has absolutely no idea how to get from seemingly random pixel values to a single output. But with backpropagation (math not explained here), it can learn to adjust its calculation and become very accurate.

Mattered order knew who?

With the previous neural network, it has an interesting property called Permutation Variance. If we shuffled around the pixels in the image, it would still arrive at the same answer as if the pixels weren’t touched at all. Humans don’t interpret data this way, it’s like trying to compare a finished jigsaw puzzle with one that has its pieces scattered everywhere, nothing makes sense. In short, the data is related to each other.

A dog makes sense because all of its parts are organized in a reasonable way. We can clearly see a body, legs, and a face with all the other stuff on it.

NOW THIS. I have no clue what this is, why did I even bother to create this?? I think the picture speaks for itself.

If the neural network could learn to understand features like we do, it could improve drastically.

What is a convolution and what does it want with me

Each hidden layer in the neural network is doing math with the pixels on an individual level. Instead, what if we applied the idea that pixels in an area are related to each other. This is the idea of a convolution where a math operation is applied to an area of pixels. So our neural network is now a Convolutional Neural Network (CNNs).

In this GIF the yellow square is called a kernel/filter. Both terms are used a bit interchangeably in 2D, but filter is usually used in 3D. The kernel has weights attached to it that determine the importance of the pixel in the square. The kernel slides over the image and applies the convolution to each area, the distance it moves is called the stride. Here stride = 1 but it could be 2 or 3, meaning it would skip a few pixels between each convolution.

There are other properties a convolution can have such as dilation and padding, but what you just read is the necessities. Take a look at the GIFs below to get an idea of what those two are.

Because the hidden layers in a CNN use convolutions, they can learn different features. The first layer might learn to find edges and the next one might find shapes. And because the neural net is trained to learn on its own, it can find its own features. We may only have an understanding of a dog’s body parts, but the CNN might find something even better.

Sometimes less is more.

If we go crazy and add hundreds of convolutions, the CNN becomes unnecessarily complicated and may encounter overfitting. The neural net might just start to memorize the features of a single dog instead of the general features.

Information overload

Is this all starting to hurt your head? I hope not. Similarly for CNNs, even when we scale down pictures from millions of pixels to a few thousand that’s still a lot of information that needs to be processed. If we want to keep the key info but reduce the density, we can use pooling. We want to get the most important features, so we use a max pool to extract the highest pixel values from an area.

Most CNN architectures weave convolutions and pooling like below. We find the features with convolutions and get the most important parts of the features with pooling, rinse and repeat. Then when you get to the end we can use a regular neural network to get our desired output.

The future of object detection

As it stands, CNNs are the best neural network for object detection. They have been used in other areas such as Natural Language Processing and Time Series Forecasting, but Recurrent Neural Networks (RNNs) are typically preferred as it’s not current data that’s related to each other but previous data.

There’s still a huge fundamental flaw in how object detection works seen in Adversarial Attacks, an image can be altered by adding random undetectable pixels and completely change how it’s recognized. This can be a huge issue in computer vision for self-driving cars, like if someone modified a stop sign to now be read as something else.

It’s really fascinating how even though CNNs are pretty close to how our own brains recognize stuff, there’s still a large difference in how we work. Perhaps that there’s more to an image than we realize.

What is a dog? Who can really tell?

Feel free to reach out to me at allenuy0211@gmail.com for feedback or questions!