WTF is a Convolutional Neural Network

Previously we have seen how basic feed-forward networks work with the weight-bias initialization, back-propagation and gradient descent.Now we shall look at another type of neural network, yes there are more than one! Particularly a convolutional neural network.This just a fancy name for a network which really good at recognising images and video data.

CNNs, like neural networks, are made up of neurons with learnable weights and biases. Each neuron receives several inputs, takes a weighted sum over them, pass it through an activation function and responds with an output. The whole network has a loss function and all that we learnt for neural networks still apply on CNNs.

So, how are Convolutional Neural Networks different than Neural Networks?The answer is that although we can use basic neural networks, it becomes extremely computationally costly, meaning your computer get fried while you attempt do it.more importantly when we deal with images the amount of parameters increases exponentially and our jobs is to be as lazy as possible and reduce our work.In the there is no reason not to use regular networks but why?


In general when we talk about a CNN we usually see an image like this

An immediate reaction is to look away form this monstrous mess.but when we understand whats going on under the hood, things start clearing up and we start to appreciate this creation.When we talk about image classification, every ones head turns to the MNIST hand written dataset containing around 52'000 images of 28x28 pixels which every beginners point of start with CNN's.we shall further use this data to understand what is going on.

Moving on, our next job is to create a neural network which can tell us the number when shown an image of a number.How do go about doing this? well we first have to create features for the image meaning how we have given height and weight as features to predict a breed of dogs, similarly we have to create features for the image then feed it to a normal neural network to predict the number.To do this let us visualize the image like this

The image is represented with a width ,height and depth based on its RGB channels but since the MNIST images are greyscale it has only two dimensions with pixel values between(0,1) representing how dark the pixel is. Next we take a tiny camera like object called a Filter or Kernel and make it run over the whole image like a robot.

Now this filter object is basically a matrix of weight values like we learnt previously and creates features by multiplying them with the pixel values of the image. this results in a image with the same dimensions but different composition.it may first look like gibberish since we choose random weight values for the filter, but as we go on optimizing these values,they start representing hgher abstractions of the image to better visualize this vist

Next we use a bunch of these filters with different weights on the same image which creates new images, how many? how many filters we use. next we stack them like pancakes creating something called a feature map which has a shape of (ImageHeight * ImageWidth * number of filters) which looks something like this

Next we generally add a activation function such as a ReLu to map the pixel values and fix dilute/saturated gradients.Furthermore, Since we have a basic one channelled image it would be fine. but in other cases, we would have to reduce the number of parameters in a efficient manner to be able to run this, remember we are lazy.This is where Pooling or subsampling comes into picture.Pooling is simply a way to make shrink the input without disturbing the image(features).It, also like a filter is of defined size, and goes through the 3 dimensional feature map and shrinks it using various methods like max-pooling or mean-pooling.their job is simple, simplify the input.


This process of convolving, activating and pooling is repeated multiple times and each time we are simply creating more abstract features so that the densely or normal feed-forward network can classify one single image.

To better understand this let us go over the entire process at a higher level. First the network is divided into two parts, the feature-extraction and then classification.The network first takes a image(28x28px), creates arbitrary weight values of some filter size then goes over the whole images and create variational copies of the images.note that we are not trying to create weights in such a way that it classifies only 3’s or 4’s but every number.Moving on when the filter window moves across the image, it create a version of the images and we use the desired amount of filter types say 32, to create a 32 of versions of the original image and then stack them together.This creates a block of images of size 28x28x32 where 28 is the height and width of image and 32 is the number of images.This process of filter rolling and activating is called convolving hence the name convolution.Next we pool the block of images to reduce computation.Going further, the next convolution will consist of creating more features from the block of images we created and this process is repeated several times.At the end we create a normal neural network generally called a fully-connected network where we pass the flattened version of the last feature block meaning we pass one long string of the last feature block, after which the densely connected network does its magic like we discussed last post.

There is still one thing left for us to do.we never adjusted the weight values anywhere.This is where an optamizer like gradient-descent comes in, and using back-propagation we fix the weight values in such a way that each filter consisting of these weights can detect and make more abstract versions of the original image.And so we train the model with a lot of images using this process so that it can predict new unseen images with extreme accuracy.

We have finally understood what this mess represents!at least i hope you do.Now having learnt this, i would believe this creation is way more interesting to you than what you might have started with.


Moving on this method of learning for a neural network is not restricted to images, almost any thing which can be mapped into a grid like structure without positional influence can used. for example sales data for that time can be mapped into a grid and be used in this process.It is quite intriguing how close this technology resembles to our vision-learning process.

Further more you might notice that there is not much math or calculus use in these explanations but they are an integral part of fully understanding the actual learning part of these networks, and i strongly recommend learning at least ways of implementing these technologies using code to better understand these concepts.You can visit: