How Does a Computer Know That This is a Cat?

You look at this picture for a fraction of a second and you know it’s a cat. But how does a computer know?

Chang Xu
Upfront Insights
6 min readSep 20, 2018

--

You probably are thinking: just feed a computer many many images of cats and voilà! The computer can figure out what’s a cat and what’s not. After all, that’s where the buzz phrases Machine Learning, Big Data, and Artificial Intelligence, all seem to come together. But have you wondered exactly how that works? You don’t need a Ph.D. — in fact, you don’t even need any math. Let me break it down for you.

How does it know this is a cat in front of a door…

…not a tiger and a sunset? The colors in the pictures and the striations on the animals are quite similar.

How does it know that this line delineates the chest of this cat against a door, when the image it sees is completely flat.

Plus, the computer has never seen this particular cat in this pose with this background, because you just took this photo!

And, that is, if you’re lucky enough to get the whole cat, you might just get part of a cat.

How does a computer know that this is a cat?

Understanding images is incredibly hard, we take it for granted because our phone opens when it sees our face.

To a computer, an image is a collection of pixels. Each pixel is represented with three numbers from 0 to 256 that are their RGB values, or how much Red, Green, and Blue is in each pixel.

Each photo you take on your iPhone X is 12 megapixels, which means that it has 12 million pixels.

As each pixel is represented by three numbers, each photo is represented by 36 million numbers. To even know it’s a cat, your computer needs to analyze 36 million numbers.

It’s actually even more complicated than that. Because just one pixel doesn’t tell you that it’s a cat. Groups of pixels together give you the eyes, ears, whiskers, and now you know it’s a cat. The computer needs to analyze the relationships of these 36 million numbers to each other. This means that computers need to analyze trillions of relationships just to look at a single image.

So how can a computer be able to tell you if this is a cat in a split second, as quickly as a human can? As you can see, understanding what is in an image is staggeringly complex.

Only recently have computers been able to do this. This is where algorithms, or intelligence, come in. We have developed algorithms to simplify the computation so that it wouldn’t take days or months to tell you if you took a photo of a cat.

Convolution

I’m going to explain one algorithmic innovation to you that is critical to understanding images, which is convolution. It is so foundational to image recognition that you’ll be glad that you can understand it. Plus, after this bit, you can throw around the term “Convolutional Neural Networks” like it’s nobody’s business.

If you look at an image, you’ll notice two things:

One. You look at each area separately. The door ledge on the bottom right is not relevant to the fact that there are cat ears on the top.

Two. If you know what cat ears look like, then you can find all occurrences of cat ears in the entire image.

Convolution is a method that allows you to easily and quickly do those both of those things to images. Otherwise, as mentioned before, you might look at how each pixel is related to every other pixel in the entire image, or decipher each time whether you’re looking at cat ears. Using convolution saves you a lot of work.

How does it do that? Convolution takes an image, which, if you remember, is a matrix of 36 million numbers, and apply an operation to each number that takes into account of the numbers immediately surrounding it but not the numbers that are farther away.

So if we zoom in on a cat ear, I take the pixels in this small box and I convolve all the numbers in these pixels to come up with a new number. Now this number has the information from itself and the eight pixels surrounding it. But it doesn’t have any information from the pixels that are far away and not adjacent to it, which makes this an efficient operation. This operation is called a convolution. I convolve the pixels and it tells me that I’m looking at a line at a certain angle.

I convolve over larger and larger areas and I see two lines at these angles with a furry texture in the middle. This tells me that I’m looking at the ear of a cat.

I convolve all over the image and now I see that there are two cat ears, eyes, whiskers, and paws. Now I have pretty good confidence that I’m looking at a cat. This is why you often hear the words Convolutional Neural Networks, or CNNs, or ConvNets, when you hear discussions of computer vision.

There we go! Now when you see an image, you see the 1’s and 0’s (actually numbers from 0 to 256) same as what computers see. Plus, now you have a sense of the computational weightlifting they do to understand what’s in an image.

If you are interested in computer vision, check out my other posts:

If you are building a startup using computer vision, I’d love to talk to you. Shoot me an email at chang@upfront.com.

--

--

Chang Xu
Upfront Insights

Partner @Basis Set Ventures. Investing in AI, automation, dev tools, data/ML ops. Former founder and operator. Never still, running towards the next big thing