How Do Convolutional Neural Network Works

satyabrata pal

Published in

ML and Automation

5 min readFeb 24, 2020

Behind The Scenes Of A Convolutional Neural Network

Photo by Markus Spiske temporausch.com from Pexels

Update: I am so excited to announce that my deep learning course is now available on Udemy.

In my earlier posts here and here I talked about building a neural network using fastai.

Over there I talked about the “code-first” approach of creating a neural network and I didn’t talk about any theory.

Sometimes it becomes necessary to get into the theory of a concept to get it.

Don’t worry! I won’t bore you with several lines of maths and deep philosophy about how a neural network works on a subatomic level. Rather I am planning to keep it as much visual as possible.

How Computer Sees An Image

A digital image is actually a matrix.

No! not this matrix. Actually this one →

I have taken the matrix representation of Felix- the cat from the Klein Project Blog. Here the cat image on the left is represented as a 35*35 matrix on the right. The elements of this matrix are numbers 0 and 1 and these numbers represent the colours of each pixel in the image.

0 means black and 1 means white.

How Does A CNN Processes This Image Data

A deep neural network is nothing but a collection of matrix multiplication and addition.

If you cut open a CNN and look inside it then you can see the following operations being performed →

The image matrix goes into the network as an input.
The network does not act on the large image matrix all at once. Rather it goes chunk by chunk taking a couple of pixels at a time. To select these pixels we use what is known as a “filter”. A filter is another smaller matrix usually a 3*3 matrix.
Each element of this smaller matrix is multiplied with each element of that part of the image matrix which is covered by this filter.
Finally we add them together to get the result.
This computation is known as convolution.

Got Confused? Me Too!

There’s no confusion that a good visualization can’t resolve. Suppose that we have a 4*4 image and a 2*2 filter. Going by the above description a CNN would perform the following operation.

Notice how we got the following equations as a result of each convolution operation →

aA+bB+cE+dF = x1
ac+bD+cG+dH = x2
aI+bJ+cM+DN = x3
aI+bJ+cM+DN = x4

The result of these four equations reduces to a single matrix →

This way the convolution operation reduces the 4*4 image matrix into a 2*2 matrix and thus the final matrix takes up less memory in further operations down the network.

That’s it! This is what a CNN is.

What if? the input matrix i.e. the image and the filter are of same size. In such a case we cheat. We add zeros to the spaces around the input image to make it of a bigger size. This cheating is known as “padding”. This way it sounds more technical. Just kidding!!

The Reality

All this was a simplified version of a CNN . Yet in reality a colour image is a 3D matrix.

So, in practice we have to do element wise multiplication of 3*3*3 = 27 elements of the 3D filter with each element of the image matrix.

Now, think of the image matrix of the “Felix- The Cat” which we saw in the first section. It’s a 35*35 matrix. When we multiply a 3*3 filter with this matrix then it’s a lot of computation. Think of how much computation you would need to do when you have a colour image ? The short answer is a lot more.

What we do in such a scenario? Well! we cheat again. This time we cheat by jumping over a couple of pixels while moving the filter through the image.

This jumping is known as a “stride” . Another fancy word to sound more technical.

If we jump 2 pixels at a time then it’s known as stride 2. If we jump 3 pixels at a time then it’s known as stride 3 and so on.

The below doodle will help you visualize it. Yes! I call it a doodle because it’s no where close to a “visualization”.

In the first doodle the blue shaded reason is the area which the filter covers. Here it skips two columns i.e. it jumps 2 pixels. It is a stride 2.

In the second doodle the blue shaded reason is the area which the filter covers. Here it skips three columns i.e. it jumps 3 pixels. It is a stride 3.

Conclusion

Well! what you expected ? I would write more about a CNN ? I won’t. Not at least in this article because this is pretty much all of it. This is how a CNN works behind the scene.

There’s no magic and no fancy stuff going inside it. Sorry about that.

Announcement

My deep learning course is available Udemy at 95% discount till 31st May midnight. Use this link to availe the discount.