Convolutional Neural Networks-Demystified

Published in

Analytics Vidhya

8 min readSep 23, 2020

Artificial Intelligence has become very advance over the last decade. It has even shown close to human-level accuracy in some of the tasks like Image Classification and Object Detection.

There was a breakthrough back in 2012 in the field of Computer vision when Alex Krizhevsky and his fellow mates won the ImageNet competition. They built a convolution neural network architecture to solve the problem of image classification.

This was the beginning of a new era in Computer Vision and till now we have improvised and built several architectures like VGG16, VGG19, ResNet50, ResNet150 that solve computer vision problems with almost human-level accuracy. The Convolutional Neural network was the basis for all these architectures.

But what is this Convolution Neural Network and how does it work. I will be explaining it step by step in this article, But first of all, we need to have some basics like what is a neural network in the first place and how is it different from a Convolution Neural Network(CNN)

What is a Neural Network?

At a very high-level Neural network is a kind of architecture of neurons that can learn the relationship between the input and the output. The input can be pixels of the image and the output can be what class does the image belongs to like Dogs or Cats. If you want to read more about Neural Networks, You can check out my previous blog on Neural Networks here.

Why Convolution Neural Networks over traditional approaches?

There were some traditional approaches to solve the problems with images like Machine learning algorithms using feature engineerings like Template matching and Unitary Image transforms. They were the go-to ideas for image-related problems like face detection, Face Recognition, Image classification, or object detection. But all of these approaches were based on hand-coded features and when you would feed these features into a machine learning algorithm, they would not generalize it for a wider dataset. Thus they would not work in an actual world.

How is Convolution Neural Network better than Traditional Approaches?

The CNN networks take input as an image and try to extract features from it instead of hand-coded features so it has more flexibility to learn the features.

The Artificial Neural network (ANN) takes input as all the features and tries to learn a function between all the input features and the outputs.

In this case, our input features are image raw pixels that might not have a good relationship with the outputs, So if we directly use an ANN for a task like an image classification it would consider all the pixels as features and run into a problem called the curse of dimensionality.

But if we use CNN it would first try to extract features from the image using a weight matrix called filter and then create a map of features that are generally known as a feature map. Then this feature map would be passed onto ANN or Dense layers of the network which would make a prediction.

The features map is learned during the training so we can give any kind of image and the network will try to extract and learn the features that it generated.

How are these maps created and how do filters help in the process?

The feature maps are created using the weight matrix called a filter or kernel. Initially, these weights are randomly assigned just like every other weight in the network, But they are learned during the process of training.

This is an example where the image input is 6*6 and we have a filter of 3*3 which is extracting the features. Firstly we take a portion of 3*3 from the image of 6*6 and then we calculate a dot product between the filter and that part of the image which ends up generating one number. In this case, we have generated 31.

In reality, we have images with RGB channel which means the image would be 6*6*3 in this case and the portion would be 3*3*3 and so is the filter. Imagine just flattening the vector of 3*3*3 into 1*27 and then taking a dot product between the kernel and the portion of the image.

Now, we repeat this process for the whole image by moving the filter all across the image which ends up generating a two-dimensional map or matrix. We call this map as a feature map. In practice, we use more than one filter to extract more features from the image. The common number of filters to be used is generally a multiple of 2 like 32,64,128 or 256.

Suppose we have taken n number of filters then our output from the convolution layer will be of size some number*some number * n.

Sometimes in practice, we want to extract more features using every part of the image so we use padding. Padding is just adding zeros to the corners of the images which helps us to create a bigger image so we can extract more features. It is also a technique to generate a feature map of a certain size.

We also use stride in the convolution layer which is just how many steps you would like to take on your image for feature map generation. We can use this formula to calculate the size of the feature map that will be created :

n-2*p+f/s →some number

n is the input size of the image. (if the image is 100*100 then n is 100)

p is padding (some number)

f is the filter size(if filter size is 3*3 the f is 3)

s is the stride (some number).

Now that we know about feature Maps and Convolution layers. We use some extra techniques to refine the results. There are two more components commonly seen in a CNN architecture. These are Pooling Layers and Relu activation function

Pooling Layer

A pooling layer is another building block of a CNN. Its function is to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network. The pooling layer operates on each feature map independently.

The whole idea is to reduce the size of the feature map and keep only the important features in the network.

The common ways of pooling

Max pooling
Min pooling
Average pooling

Before understanding each type of pooling let’s understand how pooling works.

In the above diagram, Assume we have created a feature map of 4*4 and now we want to decrease the size of this feature map. We can use pooling layers and pool size.

Pool size is just the size of the feature map to be used at each step for downsizing. We have used 2*2 here. This will crop out the 2*2 map out of a feature map and then take the maximum number out of it. (Max pooling)

This process will be repeated for each feature map independently. We can take the maximum number using Max Pooling. We can take the minimum number using the Min Pooling and we can also take the average of all the numbers in that region using Average pooling. The most common are Max and Average pooling.

Now that we have used Pooling layers to downsize the feature map we can use the Relu Activation function to refine the feature map.

Relu Activation function

The activation function of a node defines the output of that node given an input or set of inputs. The activation function decides which information should go forward in the network and which information should be neglected.

There are different activation functions used in neural networks like sigmoid, Relu, Tanh, leaky Relu but when it comes to CNN’s Relu is the most used activation function.

Relu(Rectified linear unit) is just the positive x given the input positive x and zeroes for negative x. This helps in refining the features and computing the gradients also becomes faster because for all negative x it will be 0.

The values in the feature maps are refined using the Relu activation function. all the negative values become zero and all the positive values stay as it is. This helps us to identify the importance of features along with the location of that feature in the network.

Sometimes we use Relu after the pooling layer and sometimes we use it before the pooling layer. It is the complete architecture choice of the practitioner and the problem statement.

This process can be repeated several layers deep like in Vgg16, Vgg19 but the basic components of the Convolutional neural networks remain the same.

Once we have these features refined we use the ANN or dense layers, in the end, to finally learn the function between the inputs and the outputs.

Let us assume we have a feature map of 30*30*20 in the end then we can just flatten the feature map out and pass it to the artificial neural network. In that case, the input would be 1*18000. This a very big vector that will be expensive to process.

This is the reason why we use pooling layers and activation function so that we can reduce the size of the final vector and preserve only the information that is required for the problem statement in hand.

Once we have extracted all the features and refined it. We can use the Neural network to do the task. We can use this neural network to do all kinds of tasks like image classification, object localization, image segmentation. The architecture will vary wr.t. to the problem statement but the core idea will remain the same.

I hope you have learned something from this article. I will be writing some more on the Recurrent neural networks.