A Comprehensive Guide to Convolution Neural Network

Published in

The Startup

11 min readNov 20, 2020

In this blog we will be focusing on what are convolution neural networks and how do they work.

The Convolution Neural Network or CNN as it is popularly known is the most commonly used deep learning algorithm. Before we get into how CNN works let us first understand the problems faced during traditional MLP and why do we need CNN at first place.

ISSUES WITH TRADITIONAL MLP & WHY WE NEED CNN ?

Let us start with simple example,

In the example shown above we would find it difficult to explain what exactly the image on our left means but when we look at the image on our right we immediately recognize that it is an image of a dog. Interesting thing is that both of the images are same. The image on the right is 2D image of a dog whereas the image on the left is just 1D image. Not only humans but computers also do find it difficult to recognize an image represented in 1D.

Let us consider another scenario,

In the above figure, first image is normal image of a dog while second image is manipulated one in which we have swap nose and the eye. Looking at the current form it makes us easy to identify the abnormalities in the images but in case of 1D it is very difficult to figure out these abnormalities. This is first problem with MLP i.e. Losing Spatial Orientation of Image. We must remember that a dog is a dog only when the nose, eyes, ears etc. are relatively present where they should be. Any changes in the relative position does not qualifies that image to be a dog.

MLP uses 1D representation of an image to identify or classify these images whereas CNN uses 2D representation to identify them. Thus CNN preserves the spatial orientation .

The other issue with MLP is more on computational side of things.

If we consider the adjoining image and create a neural network using 1000 neurons the nos. of parameters which is the weight matrix would be about 10⁶ .

If we consider the adjoining image with more nos. of pixels and build a neural network the nos. of parameters in this case would be 600 x 10⁶ (600 million).

Both the situation will be a nightmare for our computer system.

Thus the issue which we saw considering the two images of various dimensions and building neural network using single hidden layer is is called as Parameter Exploration in Neural Network.

Dealing with above two problems i.e. Losing Spatial Orientation and Parameter Exploration in Neural Network is built in CNN. It preserve the spatial orientation and also reduces the number of trainable parameters in neural network.

BASIC STRUCTURE OF CNN.

In MLP (multilayer perceptron) if we remember hidden layer was responsible for generating features. Now in CNN apart from above 3 layers we also have convolution layer. The convolution layer uses 2D input which helps to solve above issue which we discussed and also it acts like a feature extractor. So, in CNN we have convolution layer and hidden layers acting as feature extractor. These features are been extracted using filters which we will be discussing further.

ROLE OF FILTERS IN CNN.

As we saw in the structure of CNN, convolution layers is used to extract the features and for extracting features it uses filters. So, let us discuss about how the features are extracted using filter now.

Consider below image ,

In the above image we used various filters like Prewitt or Sobel and obtained the edges. For detail understanding about working on the images and extracting edges you can shoot up at my below blog for theoretical and practical implementation.

Getting Started with Image Data — Deep Learning

In this blog we will see different formats of an image, converting them to different forms and extracting edge features.

medium.com

Let us understand how filter operation basically works using an animated image.

HOW DOES FILTER OPERATION WORKS ?

Convolution Operation on 5x5 with 3x3 filter

Suppose we have matrix of numbers representing an image and we take 3x3 filter and perform element wise multiplication using the filter over the image. The result which is obtained after performing filter operation is stored in new matrix called as Feature Map.

The filter moves over the image in a manner how we write over the paper i.e. left to right. The nos. of pixels that the filter moves in horizontal direction is called as column stride. The nos. of pixels that the filter moves in vertical direction is called as row stride. This operation is known as convolution operation where filter slides through the image performs element wise operation and generates new matrix called as feature map.

UNDERSTANDING THE DIMENSIONS.

Now we know how the feature map is calculated let us look at the dimensions of input image, filter and feature map.

In the above figure we have an input image of size (13 x 8) followed by filer of size (3 x 3) and feature map of size (11 x 6 ) obtained by convolution operation.

Consider below images,

When the filter is used over the first patch of an input image it compares the pixel values on the right and the left on the target pixel 34 and stores the resultant value in feature map.

Basically feature map contains values against the pixel highlighted in the green box but pixels on the edges are not taken into account.

If we consider a pixel on an edge i.e. pixel 36 we will notice that there are no pixel surrounding the highlighted pixel and hence it is not contributing in convolution operation and hence size of feature map becomes smaller after every convolution operation.

PADDING STRATEGIES.

As we understood in previous section that pixels on the boundary do not contribute in convolution operation so to resolve that issue let us understand padding strategies.

While building a convolution layer we can set the padding strategies which can be of 2 types.

SAME : Feature map and input image are of same size since zeros are added around the edges of the image.
VALID :Feature map and input image differ in size since zeros are not added.

Imagine if we had an image of 1300 x 800 we cannot go and count every single value in output image so you all can refer below formula to calculate height and width of our output i.e. feature map.

There are few important things we must note here:

Each convolution layer can have multiple filters.
Values in the filter are not fixed and are learnt during the training process.
It is like MLP where we had parameters like weight matrix which was learnt during backpropagation process here in CNN we have filter values which are learnt during backpropagation.

Using the above formula as discussed let us try to understand the dimensions of the feature map on gray scale images.

Now instead of single filter, if we use n filters in this case we will have n feature maps stacked together. Values in the filters can be different and are learnt during backpropagation hence we can also have different feature maps of a single input image.

Lets us look at the scenario where our input images are having more than one channel i.e. RGB image.

Using the above image we cannot use our 2D filter for convolution operation as nos. of channels in the filter should be same as nos. of channels in an input image. Interestingly if we use RGB image along with 2D filter, the deep learning frameworks automatically handles it. We do not have to mention the nos. of channels. Note that the output of the operation will be 2D image.

Illustration :

Now instead of 9 values generating single value in a feature map, we will now have 27 values which will be contributing in generating a single value in feature map.

LOCAL CONNECTIVITY & PARAMETER SHARING IN CNN.

In local connectivity output pixel values takes input from a (small) local group of pixel values from the complete image. It is same as convolution operation i.e. filter multiplication happening element by element wise. If we compare with MLP (multi layer perceptron) each and every input value use to get multiplied by weight.

The architecture of CNN (discussed in later section) assures that the learnt filter produces strongest response to spatially local input patterns.

In parameter sharing all pixels in an input image share same filter matrix. If we compare with MLP each input and hidden layer where assigned different weight so nos. of trainable parameters was dependent on input size but in this case for complete input image irrespective of size of input image we use same filter map.

ARCHITECTURE OF CONVOLUTION NEURAL NETWORK.

Now comes the exciting part of this blog where we will understand the architecture of our convolution neural network in parts.

CONVOLUTION LAYER :

Consider we have 1000 images of size (200x200x3). Now this input is sent to convolution layer where we have 32 filters each of dimension (3x3x3). Considering column and row stride as1 and padding strategy as valid the shape of the output from convolution layer 1 would be (1000x198x198x32) where 1000 is nos. of images and (198x198x32) represent the dimensions of single input image. After convolution operation we use activation function to introduce non-linearity. It does not change the dimension of the output.

Now we introduce another convolution layer with 64 filters and size (3x3x32). As we have 32 channels in our input which was the output of convolution layer 1. The output after this operation would be (1000x196x196x64) where (196x196x64) represent the dimension of image in second convolution layer.

So, this is how we calculate the shape of the output after series of convolution layer.

But, note that the output of convolution layer is a 3D matrix and is not the final output of the architecture. After convolution layers we add the hidden layer which is also called as fully-connected layer. As we have seen in MLP(multilayer perceptron) it takes inputs of 1D so our 3D output obtained from convolution layer will be converted into 1d and the size of images in FC layer will be (1000, 196x196x64) i.e. 24,58,624. This process is called know as Flattening.

Considering the above image we see that in FC layer against every 1000 images we have almost 24 lacks features. So if we see the input for FC layer is very huge nos. So, in order to deal with this scenario we use another layer called as Pooling Layer.

POOLING LAYER :

Pooling layer are used mainly for dimensionality reduction and since they reduce the dimension they make the computation easier and training much faster. There are two main techniques of pooling i.e. Max Pooling & Average Pooling. Before we get into the details of these techniques let us understand how pooling works.

Let us consider 2D input image of size 4x4 and window size of 2x2 with stride as one. Since window size is 2x2 we select 2x2 patch from input image, perform some mathematical operation and generate the output.

Let us now understand how do we calculate these values,

The example what we discussed so far was of 2D input. What if we have RGB image. We must remember that pooling reduces the dimensions across the height and width of an image not across the channels.

There are few more pooling techniques which are also used like GlobalAveragePooling & GlobalMaxPooling where will be be having average or max value from all the channels and it is generally used at the final layer to convert our 3D input into 1D.

FORWARD PROPAGATION IN CNN.

In forward propagation, convolution layers extracts features from input image with the help of filters and the output which is obtained is sent to hidden layer where hidden layer uses the weights and bias along with the inputs in order to calculate the output. In the backward propagation process these filter values along with weights and bias values are learnt and constantly updated.

For now, let us focus on forward propagation and understand it better and in upcoming section we will discuss about forward propagation.

For simplicity purpose I have consider single convolution layer and single neuron in hidden layer.

BACKWARD PROPAGATION IN CNN

In backward propagation we compare the output obtained with the predicted output and calculate the error. If the error is large we can say that predictions are large from the actual values. Now this error value depends upon 3 parameters i.e. weights, bias and filter values. During back propagation these values are updated.

Conclusion

So, in this blog we learnt about various issues like spatial orientation along with parameter explode. Further we discussed above convolution layer, pooling layer, forward propagation and backward propagation.

Hope you understood the basic intuition behind all these layers which are used for building CNN and used in Transfer Learning.

Connect with over LinkedIn.

Gaurav Rajpal - Data Science Aspirant - AnalytixPRO | LinkedIn

Currently mastering concepts of Data Science| Machine Learning| Deep Learning

linkedin.com

A Comprehensive Guide to Convolution Neural Network

Getting Started with Image Data — Deep Learning

In this blog we will see different formats of an image, converting them to different forms and extracting edge features.

Gaurav Rajpal - Data Science Aspirant - AnalytixPRO | LinkedIn

Currently mastering concepts of Data Science| Machine Learning| Deep Learning

Written by Gaurav Rajpal