A quick grasp of Convolution Neural Networks (CNN) and its implementation on MNIST dataset

Published in

Analytics Vidhya

6 min readMay 16, 2020

Convolutional Neural Networks(CNN) is one of the popular Deep Artificial Neural Networks. CNN's are made up of learnable weights and biases. CNN's are very similar to ordinary neural networks but not exactly the same.

CNN's are primarily used in image recognition, image clustering, and classification, object detection, etc..

Why CNN’s?

CNN's is weight sharing, less complex, and occupies less memory.

Let’s take an MNIST data set image, and it’s passed to CNN and NN.

Assume on the CNN layer,10 filters of 5x5 size, then we have 5x5x10 +10(biases) =260 params.

Assume the image dimensions 784, and a NN layer of 250 neurons, then in Neural Network (NN) we have 784 x 260 + 1= 19601 params

So, CNN's outperform NNs on conventional image recognition tasks and many other tasks.

The idea behind the working of CNN

convolution operation in computer vision is biologically inspired by the brain’s visual cortex. The connectivity pattern of CNN resembles the structure of the animal visual cortex.

If an image is passed to the visual cortex, then the cortex processes that information through the segments/layers. The brain extracts information from every segment/layer. The first layers learn representations such as edges or color while the intermediate-level layers learn intermediate abstract representations such as object parts and finally, high-level layers learn full objects like cat’s faces. with an increase in the levels of abstractions, inferences become more clear. Thus, the brain makes decisions from the information it has learned through all layers.

Layers of CNN

Cnn consists of different layers. They are the input layer and an output layer. Between these layers, there are some multiple hidden layers like “Convolution layer”, “Activation Layer”, “Max Pooling Layer”, ”Fully connected layer” . There is no limitation for hidden layers present in the network. The input layer takes the input and train specifically and gives an output from the output layer. With the help of CNN, we can use a large amount of data more effectively and accurately.

In Convolution Layer, we have

This Layer mainly helps in edge detection. For a detailed explanation refer Sobel edge detector.

Convolution operation

In mathematics, convolution is a mathematical operation on two functions that produces a third function expressing how the shape of one is modified by the other.

Convolution layer applies a convolution operation to the input for passing the result to the next layer. Each convolution processes the data only for its respective field.

Here in convolution, we perform element-wise multiplication and addition operation on image matrix, kernel/Filter matrix which produces an output matrix.

The kernel/Filter matrix contains weights that are randomly assigned and gets updated during Backpropagation.

Padding & strides

It’s clear that after performing convolution, we are obtaining a convoluted feature/ matrix of a reduced dimension. To maintain the dimension of output as in input, we use padding.

Padding process adds 0’s to the border of the input matrix symmetrically, So after convolution, we lose that padded dimensions, and thus we can retain the original dimension in output.

Dimensions before padding:

Dimensions after padding: Add padding on both sides of the image

Stride

The Kernal/ filter is moved across the image matrix left to right, top to bottom, with a one-pixel column change on the horizontal movements, then a one-pixel row change on the vertical movements.

The amount of moment of kernel over the image is termed as “Stride”.

For stride S, padding P

These kernel values get updated for every iteration.

Convolution over RGB images

We observed convolution on greyscale images having values in range 0–255.

RGB images have extra 3 channels to the image, therefore dimension looks like (n x n x c)

Where c = no. of channels [3 in RGB]

Anyhow, The channel dimension becomes neutralized in the output because element-wise multiplication & addition operation is being done on C-dim image using C-dim kernel.

Convolution with Multiple kernels:

Let's consider, we have ‘m’ kernels

Suppose, if convolution is performed on an image with more than one kernel, then dimensions of output (i.e convoluted feature) also increases by ‘m’ dimensions.

Activation Layer

Activation function introduces non-linearity to models, so that complexity increases and the model learns more.

It is used to determine the output of a neural network(like yes/no).It can also be attached between two different neural networks. After element-wise multiplication and addition using the kernel, we apply activations like ‘Relu’, ‘Sigmoid’, ‘Tanh’, ‘Leaky Relu’, ‘Softmax’ on that value.

Therefore, In the convoluted feature, every value is the output of that activation layer.

Activation( a*w1 + b*w2 + e*w3 + f*w4)

where, a, b, e, f are cluster of elements from the image.

w1, w2, w3, w4 are the weights of the kernel.

Pooling Layer

It is responsible for reducing the spatial size of the convoluted feature. Pooling combines the output of the element cluster at one layer into a single element in the next layer.

“Max pooling” uses the maximum value from each cluster of elements at the previous layer.

“Mean/Average pooling” uses the average value from clusters of elements at the previous layer.

Fully Connected

The last layers in the network are fully connected, meaning that the element of preceding layers are connected to every element in subsequent layers. This mimics high-level reasoning where all possible pathways from the input to output are considered.