Convolutional Neural Networks-An Intuitive approach-Part 1

Niketh Narasimhan
Analytics Vidhya
Published in
8 min readJul 27, 2020

A simple yet comprehensive approach to the concepts

Convolutional Neural Networks

Artificial intelligence has seen a tremendous growth over the last few years, The gap between machines and humans is slowly but steadily decreasing. One important difference between humans and machines is (or rather was!) with regards to human’s perception of images and sound.How do we train a machine to recognize images and sound as we do?

At this point we can ask ourselves a few questions!!!!

How would the machines perceive images and sound ?

How would the machines be able to differentiate between different images for example say between a cat and a dog?

Can machines identify and differentiate between different human beings for example lets say differentiate a male from a female or identify Leonardo Di Caprio or Brad Pitt by just feeding their images to it?

Let’s attempt to find out!!!

The Colour coding system:

Lets get a basic idea of what the colour coding system for machines is

RGB decimal system: It is denoted as rgb(255, 0, 0). It consists of three channels representing RED , BLUE and GREEN respectively . RGB defines how much red, green or blue value you’d like to have displayed in a decimal value somewhere between 0, which is no representation of the color, and 255, the highest possible concentration of the color. So, in the example rgb(255, 0, 0), we’d get a very bright red. If we wanted all green, our RGB would be rgb(0, 255, 0). For a simple blue, it would be rgb(0, 0, 255).As we know all colours can be obtained as a combination of Red , Green and Blue , we can obtain the coding for any colour we want.

Gray scale: Gray scale consists of just 1 channel (0 to 255)with 0 representing black and 255 representing white. The colors in between represent the different shades of Gray.

Computers ‘see’ in a different way than we do. Their world consists of only numbers.

Every image can be represented as 2-dimensional arrays of numbers, known as pixels.

But the fact that they perceive images in a different way, doesn’t mean we can’t train them to recognize patterns, like we do. We just have to think of what an image is in a different way.

Now that we have a basic idea of how images can be represented , let us try and understand The architecture of a CNN

CNN architecture

Convolutional Neural Networks have a different architecture than regular Neural Networks. Regular Neural Networks transform an input by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where each layer is fully connected to all neurons in the layer before. Finally, there is a last fully-connected layer — the output layer — that represent the predictions.

Convolutional Neural Networks are a bit different. First of all, the layers are organised in 3 dimensions: width, height and depth. Further, the neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. Lastly, the final output will be reduced to a single vector of probability scores, organized along the depth dimension

ANN vs CNN
A typical CNN architecture

As can be seen above CNNs have two components:

  • The Hidden layers/Feature extraction part

In this part, the network will perform a series of convolutions and pooling operations during which the features are detected. If you had a picture of a tiger , this is the part where the network would recognize the stripes , 4 legs , 2 eyes , one nose , distinctive orange colour etc.

  • The Classification part

Here, the fully connected layers will serve as a classifier on top of these extracted features. They will assign a probability for the object on the image being what the algorithm predicts it is.

Before we proceed any further we need to understand what is “convolution”, we will come back to the architecture later:

What do we mean by the “convolution” in Convolutional Neural Networks?

Let us decode!!!

Convolution is a simple mathematical operation which is fundamental to many common image processing operators. Convolution provides a way of `multiplying together’ two arrays of numbers, generally of different sizes, but of the same dimensionality, to produce a third array of numbers of the same dimensionality.

The term convolution refers to the mathematical combination of two functions to produce a third function. It merges two sets of information.

In the case of a CNN, the convolution is performed on the input data with the use of a filter or kernel (these terms are used interchangeably) to then produce a feature map.

We execute a convolution by sliding the filter over the input. At every location, a matrix multiplication is performed and sums the result onto the feature map.

In the animation below, you can see the convolution operation. You can see the filter (the green square) is sliding over our input (the blue square) and the sum of the convolution goes into the feature map (the red square).

The area of our filter is also called the receptive field, named after the neuron cells! The size of this filter is 3x3.

Left: the filter slides over the input. Right: the result is summed and added to the feature map.

For the sake of explaining, I have shown you the operation in 2D, but in reality convolutions are performed in 3D.In the image below only one layer is shown, there will be man layers stacked next to one anaother. Each image is namely represented as a 3D matrix with a dimension for width, height, and depth. Depth is a dimension because of the colours channels used in an image (RGB).

Filter slides over the input and performs its operation on the destination pixel

Let us try and understand the above concept in depth!!!

Convolutions on RGB images

Left is a 2d image on the right is a 3d image (RGB)

As shown above the left hand image is 2d matrix , which generally represent a grayscale scheme of colour coding , while the image on the left is an RGB scheme of colour coding with the three dimensions representing height , width and the 3 colour channels namely RGB respectively.

A filter of size 3*3 is used to produce a 2d matrix 4*4

Let’s name them: this first 6 here is the height of the image, the second 6 is the width, and the 3 is the number of channels. Similarly, our filter also have a height, width and the number of channels. Number of channels in our image must match the number of channels in our filter, so these two numbers have to be equal. The output of this will be a 4×4 image, and notice this is 4×4×1, there’s no longer 3 at the end.

Convolution for RGB

To perform the convolutional operation , let us denote the filter (3*3*3) by a cube ,Now the cube has 27 numbers in total. So when we slide the cube over the RGB matrix as shown in the figure above on the left. The first layer of the cube covers 9 numbers in RED ,the second layer of the cube 9 numbers of green channel exactly adjacent to RED and the last layer of the filter covers 9 numbers of Blue that are exactly adjacent to RED and GREEN.We multiply the corresponding numbers which are exactly overlapping in the RGB matrix and the filter and add them up to obtain our first output. We keep sliding the filter to obtain different outputs.

Feature detection:

Supposing we want to detect the edges of an image , how can we chose a filter

Detecting the features of an edge in one channel:

We choose the first filter as 1,0,−1,1,0,−1,1,0,−1 ( as we already did). This can be for a red color, for the green channel the values will be all zeros and for the blue filter as well. We stack these three matrices together to form our 3×3×3 filter. Then, this would be a filter that detects vertical edges, but only in the red channel.This is shown in the image above

Detecting the features in any colour or all three channels:

Alternatively, if it is not important what color the vertical edges are, then we might have a filter with 1s and −1s in all three channels ( Shown in the second example in the image above). In this way we got a 3×3×3 edge detector that detects edges in any color.

Similarly Different choices of the parameters will result in different feature detectors.

Detecting Multiple features:

Let us ask ourselves the question

How do we detect horizontal and vertical edges at the same time?

How do we detect features along lets say a 45° or 70° as well.

well!! the answer is simple we use two/Multiple filters. We obtain a convoluted ouput using the yellow filter and an orange filter , the resulting matrix is a 4x4x2 , which represents the horizontal and vertical edges respectively or along 45° or 70 as per the requirement by altering the filter parameters appropriately.

Note: There can be 15 or maybe 100 or maybe several hundred different features. Finally the output will then have a number of channels equal to the number of features we are trying to detect.

A real world example is shown below:In the first image feature along the horizontal (in white)are more distinct as the first parameters are kept to all 1’s in the filters similarly in the second features along the horizontal are more distinct.

Now that we have understood the basic concepts of convolution in an RGB matrix , let us proceed further with the CNN architecture

Advantages of using convolution nets over regular feed forward Neural networks

As can be seen on the left, why not flatten a 3x3 matrix and feet it into a conventional neural net as a 9x1 input?

Well the above process works well in cases of the most binary images that too with a sub par accuracy and precision.

A ConvNet is able to successfully capture the Spatial and Temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and re-usability of weights. In other words, the network can be trained to understand the sophistication of the image better.

So now we have understood what we mean by convolution and it’s basic application in the colour coding system!!!!

Please find more details ahead in Part 2 of the article.

--

--