Understanding CNNs!

vandan gorade

Published in

Analytics Vidhya

7 min readJun 26, 2020

What do you think when you as a human look at a world?

We can easily tell what there in the world its color, shape, size, texture…everything.

But How?

Human brains has visual areas in visual cortex which are responsible for detecting motion, stereo, edges, color, texture segregation, segmentaion and so on.

Now the question is how do we enable machines view the world as we humans do, perceive it in a similar manner and even use the knowledge for multiple tasks such as Image and Video Classification, Segmentation, image generation, Recommendation…and so on. The answer is Computer Vision one of the many fields in Artificial Intelligence.The advancements in Computer Vision with Deep Learning has been constructed and perfected with time, primarily over one particular algorithm — a Convolutional Neural Network.

Convolutional Neural Networks

(CNNs) are analogous to traditional ANNs.

In the ANNs the input is goes through series of hidden layers. The layers are consist rather they are made of set of neurons and all layers are fully connected to the neurons of previous layer. The only notable difference between CNNs and traditional ANNs is that CNNs are primarily used in the field of computer vision.

Why not ANNs for image related tasks?

Well, the answer is simple

Hardware Dependence i.e high computational power required.
CNNs tends to give better accuracy than ANNs.

But why does it matter? Surely we could just increase the number of hidden layers in our network, and perhaps increase the number of neurons within them?

The simple answer is NO. There are two reasons,

we do not have unlimited Computational Power and time to train these huge ANNs.
The second reason is stopping or reducing the effects of Overfitting. Overfitting is basically when a network is unable to learn effectively due to a number of reasons. Overfitting when model learns noise present in training data to the large extent and it negatively impact on our performance. Overfitting can also happen when we have less amount of data for our model to train on.

CNN Architecture

Overall Architecture

The input layer will hold the pixel values in image
The convolutional layer will determine the output of neurons of which are connected to local regions of the input through the calculation of the scalar product between their weights and the region connected to the input volume. The rectified linear unit (commonly shortened to ReLu) aims to apply an ‘elementwise’ activation function such as sigmoid to the output of the activation produced by the previous layer.
The pooling layer will then simply perform downsampling along the spatial dimensionality of the given input, further reducing the number of parameters within that activation.
The fully-connected layers will then perform the same duties found in
standard ANNs and attempt to produce class scores from the activations,
to be used for classification. It is also suggested that ReLu may be used
between these layers, as to improve performance.

Through this simple method of transformation, CNNs are able to transform the original input layer by layer using convolutional and downsampling techniques to produce class scores for classification and regression purposes.

Input to CNN

The image on L.H.S is the RGB image i.e the image is separeated by three colors Red, Green, Blue. Image can be separated by many other color spaces for example, HSY, CMYK.

The neurons that the layers within the CNN are comprised of neurons organised into three dimensions, the spatial dimensionality of the input (height, width, depth). depth is the number of channels.

Convolutional layer

what is convolution?

matrix(f = (5,5,1)) * kernel/filter(g=(3,3,1)) = convolve feature

Mathematical definition:

convolution is a mathematical operation on two functions (f and g) that produces a third function expressing how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it.

Intuitive definition:

Convolution is the first layer to extract features from an input image. Convolution preserves the relationship between pixels by learning image features using small squares of input data. It is a mathematical operation that takes two inputs such as image matrix(f) and a filter or kernel(g).

why kernel(g) shifts on matrix(f) total 9 times and also only in right direction?

It happens because the Stride length = 1.

Now what in the world is stride?

Stride is the number of pixels shifts over the input matrix. When the stride is 1 then we move the filters to 1 pixel at a time. When the stride is 2 then we move the filters to 2 pixels at a time and so on.

It helps to reduce size of matrix and therefore it also use for compression of images and video data.

Let’s look at more practical example,

n this zoomed in image of the dog, we first start with the patch outlined in red. The width and height of our filter define the size of this square.

We then move the square over to the right by a given stride (2 in this case) to get another patch.

What if image has multiple channels(eg.RGB)?

In this case we need to perform convolution operation on each channel and its respective kernel seperately and add all results + bias to give us a squashed one-depth channel Convoluted Feature Output.

why we need convolution? and what’s the objective of it?

As shown in image above, We need to perform convolutional operation for extracting features from image in the first layear we extract features like edges, color, gradient orientations etc. As we go deeper into convolutional layer we get some more features for example if input image is dog the after some layers we started getting features like ears, eyes.

What if the filter/kernel does not fit the input image properly?

We need to do something called Padding.

Zero Padding — Pad the picture with zeros (zero-padding) so that it fits

Same padding

Valid padding — Drop the part of the image where the filter did not fit.

Pooling Layer

Why we need it?

We need to perform pooling when image dimensions are high it will reduce image size and keep only Dominant features basically it is responsible for reducing the spatial size of the Convolved Feature. This is to decrease the computational power required to process the data through dimensionality reduction. Spacial dimesions can be of different types,

Max pooling
Mean pooling

Fully-Connected Layer

The fully-connected layer contains neurons of which are directly connected to
the neurons in the two adjacent layers, without being connected to any layers
within them.

We flattened out Matrix into vector and feed it to FC layer. As you can see in above diagram after flattening we combined these features together to create model and Over a series of epochs, the model is able to distinguish between dominating and certain low-level features in images and classify them using the Softmax Classification technique.

There are various CNN architecture available each with there own Pros and Cons Some of them are given below.

LeNet
AlexNet
VGGNet
GoogLeNet
ResNet
Inception

I will definitely cover them in my future blog posts so stay tuned!

Conclusion

Convolutional Neural Networks differ to other forms of Artifical Neural Network in that instead of focusing on the entirety of the problem domain, knowledge about the specific type of input is exploited. This in turn allows for a much simpler network architecture to be set up.

In This Blog the basic concepts of Convolutional Neural Networks,
explaining the layers required to build one and detailing how best to structure
the network in most image analysis tasks.

References

Google Images
https://en.wikipedia.org/wiki/Convolutional_neural_network
Standford computer vision for visual Recognition
https://www.mathworks.com/discovery/convolutional-neural-network.html
3blue1brown
ineuron.ai

Thank You!