How Convolutional Neural Networks view the world (ANN Series #1)
This is a series I have started from my research and study of various Neural Networks. I hope to share what I have learnt and also be part of discussions revolving around it. Do share your views/opinions!
‘CNNs do not suffer from the curse of dimensionality!’
CNNs are an attempt to make a computer/computing system view the world just like a human does . Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex.
Technically speaking, a convolutional neural network (CNN) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.
Why does CNN matter?
A major part of the most recent research work happening in the Data Science and Machine Learning community revolves around neural nets like CNN. CNNS can help us understand the current state of our environment (say, using satellite imageries that can help democratize the power to make informed decisions regarding a country’s resources/policy decisions), solve image detection problems (like self driving cars or real time analysis of behaviour or describing a photo) and cut to now, even make you dance without you actually dancing. ( Everybody dance now )
Architecture of CNN

As CNN is a variation of multilayer preceptron, it consists of many hidden layers. So, the design of a CNN is
Input Layer + Multiple Hidden Layers + Output Layer
Hidden layers are convolutional layers, pooling layers, fully connected layers and normalization layers.
Convolutional Layer
Convolutional layers apply a convolution operation to the input, passing the result to the next layer. The convolution emulates the response of an individual neuron to visual stimuli. Each convolutional neuron processes data only for its receptive field.
What makes CNN better?
Although fully connected feedforward neural networks can be used to learn features as well as classify data, it is not practical to apply this architecture to images. A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10000 weights for each neuron in the second layer. The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters. (A better explanation can be found here.)
Local Connectivity Factor : In CNN, each neuron is connected to only a small chunk of input. Local connectivity increases the computational efficiency and reduces the computational time.
CNNs also resolves vanishing / exploding gradients problem. Exploding gradient problem arises when a large error gradient accumulates and results in very large updates to NN during training. It makes the model unstable/unable to learn. This is done by using ReLu units instead of sigmoidal non linear units/functions.
ReLu function :
f(x)= x for x≥0; 0 otherwise.
So ReLu cancels all negative values and propagates only the non-negatives ones. Whereas, the sigmoidal function takes more computational power and time to be calculated.
Sigmoidal function :

Pooling Layer
It is common to periodically insert a pooling layer between successive convolutional layers in a CNN architecture. The pooling layer serves to continuously to reduce the number of parameters, amount of computation in the network and reduce the spatial size of the representation, and hence to also control overfitting. The most popular pooling layer function is maxpooling.
To understand this better, consider the following VGG16 architecture. (VGG16, aka OxfordNet is another variation of CNN)
The layers drawn in red are the maxpooling layers. Notice how the size of the layer becomes smaller at each maxpooling layer? That’s discretization happening.
How does it work? →
Let’s say we have a 4x4 matrix representing our initial input.
Let’s say, as well, that we have a 2x2 filter that we’ll run over our input. We’ll have a stride of 2 (meaning the (dx, dy) for stepping over our input will be (2, 2)) and won’t overlap regions.
For each of the regions represented by the filter, we will take the max of that region and create a new, output matrix where each element is the max of a region in the original input.
So, only the most important or relevant features are carried forward for consideration/learning.
These are the very basic and few fundamental ideas behind the concept of CNNs. There are many exciting variations to this. More on that, later!
I shall exit(0) now.
References :
