Convolutional Neural Networks-An Intuitive approach-Part 2

Niketh Narasimhan
Analytics Vidhya
Published in
6 min readJul 27, 2020

A continuation of an earlier article

Please find the link of part 1

So now that we have learnt what exactly is convolution , let us see how we can use it for various tasks such as image classification etc

CNN Architecture

As can be seen above CNNs have two components:

  • The Hidden layers/Feature extraction part

In this part, the network will perform a series of convolutions and pooling operations during which the features are detected. If you had a picture of a tiger , this is the part where the network would recognize the stripes , 4 legs , 2 eyes , one nose , distinctive orange colour etc.

  • The Classification part

Here, the fully connected layers will serve as a classifier on top of these extracted features. They will assign a probability for the object on the image being what the algorithm predicts it is.

The input feeds into a convolution layer

The convolution layer:

We have understood the basic concept of convolution earlier.

Let us learn two new important concepts

Stride:

Stride is the number of pixels shifts over the input matrix. When the stride is 1 then we move the filters to 1 pixel at a time. When the stride is 2 then we move the filters to 2 pixels at a time and so on. Stride is the size of the step the convolution filter moves each time. A stride size is usually 1, meaning the filter slides pixel by pixel. By increasing the stride size, your filter is sliding over the input with a larger interval and thus has less overlap between the cells.This means that each output value in the activation will be more independent of the neighboring values. As can be seen below

The Kernel Moving over the input with size 1 and padding 1(1 layer at the borders)

Padding:

Sometimes filter does not fit perfectly fit the input image. We have two options:

  • Pad the picture with zeros (zero-padding) so that it fits
  • Drop the part of the image where the filter did not fit. This is called valid padding which keeps only valid part of the image.
Padding a 5x5 image with zeros to obtain a 5x5x1 output

When we augment the 5x5x1 image into a 6x6x1 image and then apply the 3x3x1 kernel over it, we find that the convolved matrix turns out to be of dimensions 5x5x1. Hence the name — Same Padding.

On the other hand, if we perform the same operation without padding, we are presented with a matrix which has dimensions of the Kernel (3x3x1) itself — Valid Padding.

Note: Adding padding increases the output volume whereas stride decreases it.

Feature Map

In the case of a CNN, the convolution is performed on the input data with the use of a filter or kernel (Both refer to the same thing) to then produce a feature map(output).

Calculating the Number of parameters:

Below are the steps to calculate the parameters:

Inputs and output dimensions are same in the above formula

Non Linearity (ReLU)

ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).

Why ReLU is important : ReLU’s purpose is to introduce non-linearity in our ConvNet. Since, the real world data would want our ConvNet to learn would be non-negative linear values.

ReLU is important because it does not saturate; the gradient is always high (equal to 1) if the neuron activates. As long as it is not a dead neuron, successive updates are fairly effective. ReLU is also very quick to evaluate.

Compare this to sigmoid or tanh, both of which saturate (if the input is very high or very low, the gradient is very, very small).

More generally, nonlinear activation functions are important because the function you are trying to learn is usually nonlinear. If nonlinear activation functions weren’t used, the net would be a large linear classifier, and could be simplified by simply multiplying the weight matrices together (accounting for bias). It wouldn’t be able to do anything interesting, such as image classification or text prediction.

Pooling layer:

Pooling layers section would reduce the number of parameters when the images are too large. Spatial pooling also called subsampling or downsampling which reduces the dimensionality of each map but retains important information. Spatial pooling can be of different types:

  • Max Pooling
  • Average Pooling
Max Pooling

On the left is an example of Max pooling , where the the maximum value is chosen form the matrix formed during each slde.

As can be seen on the left , Average pooling choses the average of all numbers in the matrix formed during the slide.

for example average of 12,20,8 and 12 is 13.

;

Advantages of Pooling:

  1. Dimension Reduction: In deep learning when we train a model, because of excessive data size the model can take huge amount of time for training. Now consider the use of max pooling of size 5x5 with 1 stride. It reduces the successive region of size 5x5 of the given image to a 1x1 region with max value of the 5x5 region. Here pooling reduces the 25 (5x5) pixel to a single pixel (1x1) to avoid the “curse of dimensionality”(Refer to this term , important concept in machine learning).
  2. Rotational/Position Invariance Feature Extraction : Pooling can also be used for extracting rotational and position invariant feature. Consider the same example of using pooling of size 5x5. Pooling extracts the max value from the given 5x5 region. Basically extract the dominant feature value (max value) from the given region irrespective of the position of the feature value. The max value would be from any position inside the region. Pooling does not capture the position of the max value thus provides rotational/positional invariant feature extraction.
  3. Reduce Overfitting: Higher dimensionality in an imput means we’ll need to use more parameters which can lead to overfitting. Thus, we need a method for reducing this dimensionality so that we can avoid overfitting and this is performed by the pooling layers.
Pooling operations.

Classification:

CNN architecture

Fully connected layers:

As can be seen in the above figure a 14x14x3 matrix has been flattened into a single layer of 588x1

Now that we have converted our input image into a suitable form for our Multi-Level Perceptron, we shall flatten the image into a column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied to every iteration of training. Over a series of epochs, the model is able to distinguish between dominating and certain low-level features in images and classify them using the Softmax Classification technique(using the softmax activation function).

Summary

  • Provide input image into convolution layer
  • Choose parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply ReLU activation to the matrix.
  • Perform pooling to reduce dimensionality size
  • Add as many convolutional layers until satisfied
  • Flatten the output and feed into a fully connected layer (FC Layer)
  • Output the class using an activation function (Logistic Regression with cost functions) and classifies images.

There are multiple types of architecture of CNN, a few have been listed below, interested folks can research them!!! I will try and cover them in subsequent posts!!

  1. LeNet
  2. AlexNet
  3. VGGNet
  4. GoogLeNet
  5. ResNet
  6. ZFNet

--

--