Published in

Analytics Vidhya

12 min readApr 12, 2020

So what is going on with the world right now? How come the robots process things through vision and classify them correctly better than humans? Soon can we expect robots to play hide and seek with us? Are we near the judgment day?

Also, what is the real buzz that’s going on in the field of Computer Vision and CNN’s?

A whole bunch of such questions must be running straight in our minds right now. So before answering them in the best and easiest way possible, Its strongly recommended to grab a cup of black coffee better, sit back and let the neural nets in our brain do the rest.

Yes, Computers are getting better day by day well, that is for sure!. Is not everything in the world is getting better and better ? or at least trying to become a slightly better version of them than the existing one ?. So does the field of Artificial Intelligence. It is continually getting evolved with more efficient ways to learn and adapt itself in this fast-growing world with the influence of massive velocity of data available in this data-driven world.

The field of Artificial Intelligence is in existence as early as the 1950s where John McCarthy first coined the term “Artificial Intelligence,” who was later in history prescribed as one of the founding fathers of A.I.However, one can strongly argue that humans throughout history were continually working on automating things to make their daily life more comfortable, reducing errors and risks involving humans. In the past, some experts have always expressed concern towards A.I.; nevertheless, researches today will help us better prepare for and prevent potential negative consequences in the future, thus enjoying the benefits of A.I. while avoiding pitfalls.

The truth lies in the midway of both these approaches; in some approaches, A.I. will make jobs more comfortable and more uncomplicated for humans while, on the other hand, robots and intelligent machines will replace humans in less safe jobs. However, this constructive A.I take over in certain areas does not pose any threat to any human employment. Automation is the real essence of the modern-day revolution, and potentially, not all careers get destroyed; indeed, it means the creation of a lot more new job opportunities.

image source:https://techinsight.com.vn/language/en/sparkling-profile-of-3-god-fathers-of-ai/

Due to the increase in computational power of modern-day systems with increased manufacturing facilities to mass-produce system components at comparatively less cost and also due to revolutionary works of scientists like Geoffrey Hinton, Yann Lecun and Yoshua Bengio and many others have paved the way for more powerful A.I to come into play.

Now let us dive into the field of Computer Vision and the underlying Neural concept behind it. The answer lies in how we, as humans process things we see in the real world. We must all be aware of how our brain functions, and for those who do not, here is the more straightforward explanation. The human brain is a very complex structure. It continually communicates and coordinates information by sending energy pulses through a simple unit structure called neurons. Millions of such neurons attach themselves through synapses (try to imagine a people(neuron) joining hands(connecting through dendrites) with each other )for the proper transfer of information.They regularly update themselves through energy signals to produce specific responses in the brain, which in turn responsible for our daily activities.

The concept of deploying brain-behaviors directly into the field of Artificial Intelligence is called Neural Networks. There are several types of Neural Networks like Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks, and many other types. Most of the time, one or the other depends on each other for an effective result. Today, we will concentrate on convolutional neural nets(aka CNN’s) and their role in image processing.

How does the computer look at an image?

When a computer looks at an image (takes an image as input), it will see an array of pixel values. Depending on the resolution and size of the image, it will see a 32 x 32 x 3 array of numbers (The three refers to RGB values). To drive home the point, let us say we have a color image in JPG form, and its size is 480 x 480. The representative array will be 480 x 480 x 3. Each of these numbers is given a value from 0 to 255, which will describe the pixel intensity. These numbers, being meaningless to us when we perform image classification, are the only inputs available to the computer. The idea is that give the computer this array of numbers, and it will output numbers that describe the probability of the image being a specific class (.80 for the cat, .15 for the dog, .05 for the bird, and such cases).

What We Want the Computer to Do

Now that we know the problem with the inputs and outputs, let us consider how to approach this. What we want the computer to do is to be able to differentiate between all the images supplied and figure out the unique features that make a dog a dog or a cat as a cat. That is the process that goes on in our minds subconsciously as well. When we look at a dog’s picture, we can classify it through identifiable features such as paws or four legs. Similarly, the computer can perform image classification by looking for low-level features such as edges and curves and then building up to more abstract concepts through a series of convolutional layers.

There are four main phases in the Convolutional Network shown above:

Convolution
Non Linearity (ReLU)
Pooling or Sub Sampling
Classification (Fully Connected Layer)

Phase 1:Convoultion

What is Convolution?

The easiest way to understand a convolution is by thinking of it as a sliding window function applied to a matrix. That is a mouthful, but it becomes quite clear looking at a visualization:

Imagine that the matrix on the left represents a black and white image. Each entry corresponds to one pixel, 0 for black and 1 for white (typically it is between 0 and 255 for grayscale images). The sliding window is called a kernel, filter, or feature detector. Here we use a 3×3 filter, multiply its values element-wise with the original matrix, then sum them up. To get the full convolution, we repeat the same process for each element by sliding the filter over the whole matrix.

One may be wondering what to do with this. Here are some intuitive examples.Taking the difference between a pixel and its neighbors detects edges:

Convolution with a 3×3 Filter. Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

(To understand this intuitively, think about what happens in parts of the image that are smooth, where a pixel color is equal that of its neighbors: The additions cancel and the resulting value is 0, or black. If there is a sharp edge in intensity, for example, a transition from white to black, one may assume that there are a massive difference and a resulting white value)

The Picture of the Taj Mahal expressed in the form of a convoluted matrix.

Phase 2:Non Linearity(ReLU)

An additional operation called ReLU used after every Convolution operation. ReLU stands for the Rectified Linear Unit and is a non-linear operation. Its output is given by:

ReLU is an element-wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in our ConvNet since most of the real-world data would want our ConvNet to learn non-linearity. (Convolution is a linear operation — element-wise matrix multiplication and addition, so we account for non-linearity by introducing a non-linear function like ReLU).

The ReLU operation can be understood clearly from the image below. It shows the ReLU operation applied to one of the feature maps obtained in the Figure. The output feature map here referred to as the ‘Rectified’ feature map.

Other non-linear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU found to perform better in most situations.

Phase 3:The Pooling Step

Spatial Pooling (also called subsampling or downsampling) reduces each feature map’s dimensionality and retains the most critical information. Spatial Pooling can be of different types: Max, Average, Sum, and many others.

In the Max Pooling case, we define a spatial neighborhood (for example, a 2×2 window) and take the most significant element from the rectified feature map within that window. Instead of taking the most abundant element, we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Because of much better results, Max Pooling is much preferred.

The following image shows an example of Max Pooling operation on a Rectified Feature map (obtained after convolution + ReLU operation) by using a 2×2 window.

We slide our 2 x 2 window by two cells (also called ‘stride’) and take the maximum value in each region. As shown in Figure, this reduces the dimensionality of our feature map.

In the network shown, pooling operation is applied separately to each feature map (notice that, due to this, we get three output maps from three input maps).

Pooling applied to Rectified Feature Maps.

The function of Pooling is to reduce the spatial size of the input representation progressively. In particular,

· Pooling makes the input representations (feature dimension) smaller and more manageable.

· It reduces the number of parameters and computations in the network, therefore, controlling overfitting.

· Makes the network invariant to small transformations, distortions, and translations in the input image (a small distortion in the input will not change the output of Pooling — since we take the maximum/average value in a local neighborhood).

· And it helps us to arrive at an almost scale-invariant representation of our image (the exact term is “equivariant”). This method is compelling because we can detect objects in an image no matter irrespective of their location.

Story so far

So far, we have seen how Convolution, ReLU, and Pooling work. It is essential to understand that these layers are the basic building blocks of any CNN. As shown in the above Figure, we have two sets of Convolution, ReLU & Pooling layers — the 2nd Convolution layer performs convolution on the output of the first Pooling Layer using six filters to produce a total of six feature maps. ReLU is then applied individually on all of these six feature maps. We then perform Max Pooling operation separately on each of the six rectified feature maps.

Together these layers extract the useful features from the images, introduce non-linearity in our network, and reduce feature dimension while aiming to make the features somewhat equivariant to scale and translation.

The output of the 2nd Pooling Layer acts as an input to the Fully Connected Layer; next section addresses the concept of a fully connected layer.

Phase 4:Fully Connected Layer

The Fully Connected layer is a traditional Multi-Layer Perceptron that uses a softmax activation function in the output layer (other classifiers like SVM can also be used, but will stick to softmax in this post). The term “Fully Connected” implies that every neuron in the previous layer gets connected to every neuron on the next layer.

The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset. For example, the image classification task we set out to perform has four possible outputs as shown in Figure below (note that Figure does not show connections between the nodes in the fully connected layer)

Fully Connected Layer -each node is connected to every other node in the adjacent layer

Apart from classification, adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features. Most of the features from convolutional and pooling layers may be suitable for the classification task, but combinations of those features might be even better.

The sum of output probabilities from the Fully Connected Layer is 1. The probabilities can be by using the Softmax as the activation function in the Fully Connected Layer’s output layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sums to one.

Putting it all together — Training using Backpropagation.

As discussed above, the Convolution + Pooling layers act as Feature Extractors from the input image while the Fully Connected layer acts as a classifier.

In the Figure below, since the input image is a boat, the target probability is 1 for Boat class and 0 for the other three classes, i.e.

· Input Image = Boat

· Target Vector = [0, 0, 1, 0]

The overall training process of the Convolution Network includes:

Step1: We initialize all filters and parameters/weights with random values
Step2: The network takes a training image as input, goes through the forward propagation step (convolution, ReLU, and Pooling operations along with forwarding propagation in the Fully Connected layer) and finds the output probabilities for each class.
Let us say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]
Since weights get randomly assigned for the first training example, output probabilities are also random.
Step3: Calculate the total error at the output layer (summation over all four classes)
Total Error = ∑ ½ (target probability — output probability) ²
Step4: Use Backpropagation to calculate the gradients of the error concerning all weights in the network and use gradient descent to update all filter values/weights and parameter values to minimize the output error.
The weights get adjusted in proportion to their contribution to the total error.
When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].
Now the network has learned to classify this image correctly by adjusting its weights/filters such that the output error gets reduced.
Parameters like several filters, filter sizes, the architecture of the network, and many other features have fixed before Step 1. They do not change during the training process — only the values of the filter matrix and connection weights get updated.
Step5: Now, Repeat steps 2–4 with all images in the training set.

Hence this post should be a good start in understanding CNNs, which is by no means a comprehensive overview. It aims to give a consecutive start in the world of CNN. On that note, An Article on programmatic implementation of CNN in Medium is going to be published in real time soon. Till that stay safe and healthy!

For further references and thanks to :
1.https://techinsight.com.vn/language/en/sparkling-profile-of-3-god-fathers-of-ai/
2.https://cacm.acm.org/magazines/2019/6/236990-neural-net-worth/fulltext?mobile=false
3.https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
4.https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
5.https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/
6.http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
7.http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
8.https://www.freecodecamp.org/news/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050/
9.https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

What We Want the Computer to Do

What is Convolution?

Story so far

Putting it all together — Training using Backpropagation.

Written by Vishaal Saravanan