Convolutional Neural Network : Comprehensive Guide

6 min readMar 20, 2021

We have been analyzing this world without using conscious effort. We visualize everything around us, make predictions, generate hypothesis, and act upon them accordingly. By looking at the picture illustrated above, we might assume that Lucy is helping Bella learn the piano. That is something we have been doing everyday subconsciously. We perceive things as labels, recognize patterns and make predictions upon them. But how are we able to interpret these things effortlessly? The answer lies behind the two most essential features of nature: vision and intellect. Whatever we see around us, the brain is responsible for making sense out of it. The complex structure of neurons help us create labels and predict the situation. It seems so natural to us that we hardly think about it. But what if we want to transfer these interpretation qualities to a machine? This is where a Convolutional neural network comes into the picture.

The learning process is similar to a child who has just begun to learn about this world. To teach him about an apple, we show him the apple or images of it multiple times to make him/her remember and label it like an apple. We may do the same for machines as well. However, machines can only understand the numeric language. Luckily the advancement in computer vision and deep learning has risen and is building multiple practical algorithms. Predominantly one particular algorithm- Convolutional Neural Network.

What is a CNN:

A Convolutional neural network or ConvNet is a type of algorithm which inputs an image, assigns weights and performs classification.
It is a bit different from the standard classification algorithms. The regular neural networks reconstruct the input image by passing it through hidden layers. Each layer consists of numerous neurons which in return are connected to the next layer of neurons. In the end the fully connected layer represents the prediction. However, ConvNet categorizes the layer in 3-dimensions: height, width, and depth. The general architecture of the convolutional neural network is analogous to the connectivity of the brain. It can detect both the spatial and temporal dependencies in an image using relevant filters with high preprocessing. It is mostly suitable for image classification due to the reusability of weights and reduced image parameters.

Classic CNN Architecture:

Convolutional layer
Activation operation(ReLu)
Pooling layer
Some other layers depending about the requirements
Fully connected layer

Image Visualization:

1. Convolutional Layer:

First of all, let’s take an image of dimensions 5x5x3 pixels. In which height and width are five while the RGB format of an image or depth is three. In ConvNet, the convolution is performed on the input data using a kernel or filter. The kernel hovers over the input image nine times by performing the multiplication between filter (k) and the input image (blue) portion. The filter moves to the right until it parses all RGB layers depth, and the image is completely traversed. This convolutional process generates an output as a convolved feature. Below we can see the (green) moving stride of the filter is moving over a (blue) input image to generate a red feature map.

Kernel =
1 0 1
0 1 0
1 0 1

Source:https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

Source: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

In the above animation, the filter parses all the RGB layers with valid padding and produces three kernel channels which then combine to form a convolved feature matrix. An intercept or bias is added in the linear function to adjust the output.

Same Padding — — — — — — — — — — — — — — — — — — — — — — — — — Valid Padding

As long as the input image is in RGB format, the kernel depth will be the same as the input image. The purpose of the convolutional layer is to extract the high-level features from the input image. Typically the initial layer of ConvNet is responsible for extracting low-grade features such as color, edges, orientation, etc. In contrast, the next additional layers are responsible for adapting the high-level features such as object detection and face detection to give us the wholesome understanding of the image dataset.

There are two types of results that we may obtain from the convolution. First in which the dimensionality of the convolved matrix is reduced, and in the other, it may remain the same or increase. This process is called padding. When we enlarge the 5x5x1 image to a 6x6x1 image with a kernel of 3x3x1, the feature map obtained with valid padding will be of size 3x3x1. While, with the similar kernel dimensions, the same padding will give an output of 5x5x1.

Source: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

2. RELU layer:

For a neural network to be robust, it needs to contain non-linearity. We again pass the convolved featured matrix to the Relu activation function to achieve the non-linearity. It will output the input directly if it is positive; otherwise makes it zero. The activation function solves the vanishing gradient problem to achieve the highest potential of the neural networks.

3. Pooling layer

Likewise the convolutional layer, the pooling layer is equally important. It is intended to diminish the spatial size of the featured map. Besides reducing the computational power it is also responsible for extracting the essential rotational features of images. Moreover it also increases the speed of image processing.
Generally, two methods of pooling are available. Average pooling and max pooling. Average pooling gives the average of all the covered values of the image by the kernel. At the same time, the max-pooling returns the maximum value covered in the portion of an image.

The total number of layers in a convolutional network may increase depending upon the complexities in an image and will require more computational power.
In conclusion we have empowered the model to train itself autonomously. Now, we will understand the final classification part of the ConvNet by flattening up the final output.

4. Fully Connected Layer:

After convolution and pooling layers, we add some additional layers to wrap up the CNN architecture. This FC layer is only made to accept one-dimensional data. To convert our three-dimensional data into 1D data, we use the flatten function of python. The flattened output is then fed in the neural network and performs backpropagation. In the end, the model will be able distinguish high-level features and perform classification.
There are multiple CNN architectures available which have been building powerful algorithms and shall power the whole AI.

LeNet
ResNet
ZFNet
VGGNet

Conclusion:

ConvNet is mostly applicable to image classification and recognition. However, it is an integral part of many image classification platforms such as Google Images, Amazon Recognition etc.. The general architecture of CNN is simple but with the increase in the complexity of an image we may change it accordingly.
The Breed Classifier is a remarkable example of computer vision implementation by Xehen AI in which they used convolutional neural networks for the classification of up to 200 breeds of dogs. You can learn more about the project from here: https://xehen.ai/project/petmypal?id=700

If you want to learn more about AI, take a look at our publications. We also provide FREE AI consultation to clients who want our AI services. Visit our website now to get a quote: https://xehen.ai/