Convolutional neural networks(Part-1)

Many of the images are taken from other resources and blogs . I have mentioned it in the resources . It’s just for educational purpose nothing else.

What this post is about ?…….

Prerequisite: basic knowledge of neural networks..

In this post we will mainly focus on the basic and important question that why do we need convolutional neural networks when there are already ANN’s (Artificial neural networks) present ?. What is the basic difference between simple neural and network and convolutional neural network ?. What is the upgradation in this transition from ANN to CNN and then we will understand the architecture of CNN. So let’s start and join the fray.


Firstly we will discuss Computer vision. It is a very wide field . We will only see parts of this field which are necessary for understanding CNN.

Computer Vision…

In simple language ,here we get some type of visual input and then we have to make predictions , conclusions etc related to input.

For example the most important tasks of computer vision are

  • Object classification: It is related to identifying and classifying different objects in an image. Its applications are Autonomous car driving , facial recognition etc. It is the backbone of many CNN applications.
  • Object segmentation is about figuring out where exactly different objects lie in an image.

An object classification algorithm will take this image as an input and correctly predict that there is a dog in the image. You can even identify other objects like grass , trees etc. For humans this seems to be a very simple task but for computer it is not. WHY? Because computer see it as an matrix of values.

Prediction or conclusion from this matrix is not an easy task.

In computer we look at an image as a whole to segment it into generic categories (sky, grass, road, etc.). When dealing with video, motion tracking helps figure out how a particular object moves from frame to frame.


It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.

That’s why it is very easy for computer to do mathematical calculations and difficult to make perceptions and conclusions like a small learning child after seeing different visual images from the environment.


So now when we already have artificial neural network and even deep neural networks , why we are introducing Convolutional neural networks for computer vision. Well traditional neural networks does not work well with computer vision for fact.

We will try to show you with 3 different examples the upgradation from ANN to CNN.

First example

So now let’s suppose we have made a simple ANN to recognise the image has X(1) or not(0). We have trained our ANN with left image. Now what happens when we want prediction on right side of image?.

Our network will won’t be able to recognise 2 image (shrinked version of X). This is because ANN is considered as dumb network. Any rotation , translation etc of objects in image makes ANN giving wrong output.

To a computer, an image looks like a two-dimensional array of pixels (think giant checkerboard) with a number in each position. In our example a pixel value of 1 is white, and -1 is black. When comparing two images, if any pixel values don’t match, then the images don’t match, at least to the computer. Ideally, we would like to be able to see X’s and O’s even if they’re shifted, shrunken, rotated or deformed.

This happens because ANN takes image as a whole. We put each pixel as an input to ANN. 2_D image is converted to a single 1_D vector which serves as an input. As a result structural property of image is lost.

SECOND example

Now lets suppose we have MNIST dataset of digits.

representation of 8 in matrix form….

Same way image is flattened and we get a single input vector. Image can be of dimension 8*8 or 64*64 which gets converted to 64*1 or 4096*1 input vector. Now train with all these digital images and you will get a network which recognises ‘8’ with high accuracy.



Well the answer is same because we have flattened the image loosing structural properties and hence our dumb ANN will only recognise images which are almost same as training images. So we give as many forms of 8 for better accuracy. If we would miss any variant then there is high probability we will get wrong answer for that case as a test case.

Now let enhance this point.

Firsty our “8” recognizer really does work well on simple images where the letter is right in the middle of the image:

This is because our network only learned the pattern of a perfectly-centered “8”. It has absolutely no idea what an off-center “8” is. It knows exactly one pattern and one pattern only.

In real world everything is not simple and clear . We can’t expect everything to be centred or translational or rotational invariant.

THIRD example

Same way as discussed before input image matrix would be flattened and we will get an one dimensional input array.First, we’ll train with lot of spoons. The network will try to look at all these spoons and try to figure out what makes a spoon a spoon based on patterns in the images (e.g. the reflection from the spoon head, the convex indent that is on the spoon head, the handle, etc.) We might also add images of forks and knives to understand other cases also.

Now again the accuracy depends upon the images used for training. While inputting any image of spoon , we will think that its pixels will have the same value because both images are of spoon. So we think our ANN will also compare the pixels of both images and will identify both as spoons means It should be able to find distinct patterns in these pixel input vectors that it can use to ascertain what makes a certain pixel input vector likely to be that of a spoon. ANN will do the same but will fail .


Suppose while training image 2 was not present and we think that as both are spoons there pixels will be same and ANN will identify it. NO. because

These are both images of a spoon, but they are different type of spoons because the indented parts have different reflections, the orientations are different, the angles are different, the backdrops are different…… As a result, the actual pixel input vector will not be even remotely similar. Thus, an ANN will have a hard time associating these two images.

All the above examples are simple. In real world images will be much larger in size , there will be noise in the images and objects may be located at different parts of the image . and objects may be of different . objects may be shrinked , rotated etc and simple ANN’s won’t give better result in these cases.


FIRST APPROACH : Searching with a sliding window

In case of MNIST data sets we have already made a program to identify ‘8’ but only centred in an image. One brute force approach could be searching the whole object in the image by sliding window from left to right and then down. It’s like dividing the image into segments and hoping to find the object in one of the section scanning one segment at at time.

But this approach is not efficient . We have to check the same image again and again looking for objects of different sizes.

SECOND APPROACH : Input a lot of data and Deep neural network

This approach is basically about covering every aspect ,every scenario related to objects in the image.

When we trained our network, we only showed it “8”s that were perfectly centered.We could train it with more data, including “8”s in all different positions and sizes all around the image? .

Using this approach we will have endless supply of data.So we have to make bigger neural networks so we turn our self to DEEP NEURAL NETWORKS. Without GPU this technique will be much slower and it does not make sense to train our network for same object differently for its different orientations. We are treating it as if these are different objects.


SCALING of large images in ANN. If each value in the pixel input vector is fed into a separate node, we essentially are going to have a new input feature per pixel. Imagine a 250 x 250 grayscale image — that’s 62,500 input features! That would mean something at least on the order of hundreds of thousands of weights per layer (if not more), which is simply infeasible. Such a large number of parameters would mean slow training and very likely overfitting as well.

You may ask why we are inputting pixels into neural networks ? Why we can’t extract or compute our own features ?

This approach may work but in the past it has been seen that human constructed features does not work well. You should have deep knowledge of computer vision and it’s not everytime possible to generate features for all objects in general.


So let’s go back to our first step. From the beginning we took whole image as one and thought that individual pixels tell whether that particular object is present in the image. The above statement is somewhat true but not totally because in actual what defines an object is its special characteristics which distinguishes it from other things like its boundary structure, edges , lines, corners etc . One important thing about these features are , most of them are translational , rotational invariant most of the time. So instead of considering whole image at once we should break the images into small overlapping parts and try to find out these features. By doing this then we don’t have to worry about the problems related to ANN.


First example

Now instead of taking whole image at once we try to find out special features which only X posses. Features like 2 diagonals one left and one to the right show above in green and purple boxes. Another feature is small centred X array. It is the central piece of X and will be mostly invariant in most scenarios. So now if you learn these features in first image and then test it on second image which is shrinked version of first one ,it will correctly classify it.

Same approach can be applied to second example and third example. Like in case of spoons it’s unique features could be its handle and circular bowl type shape in front and its edges, corners etc.

So remember operations on individual pixels are not enough to let us figure out if these pixels represent some object or not. Instead, we need to figure out the pixel groupings that make these edges/lines/corners and see how groupings of those edges/lines/corners then go on to form the characteristics. Furthermore, flattening images into single vectors — though retaining information of the pixels — loses information such as the structure of the image. Our neural network would not be able to exploit this structure, which is certainly important information when it comes to recognizing objects.


For the biological aspect of CNN please go to the link

In this link it is clearly mentioned about scientific biological aspect of CNN. After reading this you will get the basic idea about how researchers got idea of CNN in first place. We will not be discussing biological aspects in this post.

In my next post we will discuss the architecture of CNN and intution behind it.


THANKS for reading………..


One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.