All about Convolutions

Suvodeep Sinha
Nerd For Tech
Published in
4 min readMar 20, 2023

The term Convolutional Neural Network (CNN) has gained immense popularity over the years. It is a type of deep neural network that is commonly used for image and video recognition, analysis, and processing.

The key feature of CNNs is their ability to automatically learn and extract features from images using convolutional layers. Convolutional layers consist of a set of learnable filters that are convolved with the input image to extract features at different levels of abstraction. This blog helps to understand the work behind them and all the terms needed to be familiar with the framework.

Terminology

Overview of a Convolutional Neural Network

Before introducing the architecture of CNNs, it is important to know a few terms in brief:

Kernel:

A kernel (also known as filter or feature detector) is a small matrix of numbers that is used to scan an image to extract features. By applying multiple kernels, a CNN can detect different features in the input image, such as edges, lines, and shapes, at different spatial locations.

A convolution involving a 3x3 kernel

A convolutional operation involves sliding a filter (also called a kernel or a window) over the input image, performing element-wise multiplication between the filter and the corresponding pixels in the input image, and then summing up the results to produce a single output value. The filter is typically much smaller than the input image.

Stride and Padding

Stride denotes how many steps we are moving in each step in convolution, which by default is 1.

No.of steps = 1 (Stride)

We can observe that the size of the output is smaller than the input. To maintain the dimension of output as in input, we use padding. Padding is a process of adding zeros to the input matrix symmetrically. In the following example, the extra grey blocks denote the padding. It is used to make the dimension of the output the same as the input.

Stride and Padding = 1

The CNN Architecture

I have tried to keep the math behind the explanation hidden in this blog for the sake of understanding easily. A glimpse of the network can understand the process which has been shown above.

  1. Input Layer: This layer accepts the input image and passes it to the subsequent layers.
  2. Convolutional Layer: This layer applies a set of filters to the input image and performs the convolution operation to extract features from it. (As shown above)
  3. Activation Layer: This layer applies a non-linear activation function to the output of the convolutional layer to introduce non-linearity into the network. Most common being the ReLU and Sigmoid function.
Common Activation Functions

4. Pooling Layer: This layer reduces the spatial size of the feature maps generated by the convolutional layer by downsampling them. It is used between two convolution layers. If we apply Fully Connected Layers(7) after the Convolutional layer(2) without applying pooling or max pooling, then it will be computationally expensive and not beneficial. So, max pooling is the only way to reduce the spatial volume of the input image.

Max Pooling

In the above example, we have applied max pooling in a single depth slice with a Stride of 2. You can observe the 4 x 4 dimension input is reduced to 2x2 dimensions.

5. Dropout Layer: This layer randomly drops out some of the neurons in the network during training to prevent overfitting.

6. Flatten Layer: This layer converts the multi-dimensional feature maps generated by the previous layers into a one-dimensional vector that can be passed to a fully connected layer.

Working of Dropout and Flatten Layers

7. Fully Connected Layer: This layer connects every neuron in the previous layer to every neuron in the current layer, and performs a linear transformation followed by a non-linear activation function.

8. Output Layer: This layer produces the final output of the network, which could be a probability distribution over different classes in the case of image classification, or a set of bounding boxes and class labels in the case of object detection.

Final Result

The specific architecture of a CNN, including the number of layers and their configuration, depends on the task at hand and the characteristics of the input data.

I hope this blog was able to give a fair idea of the world of CNNs. Trying them out in Tensorflow/Pytorch or any other framework can be a good experience as well.

Until then, I would love to connect with you on Twitter as well as Github!

--

--

Suvodeep Sinha
Nerd For Tech

AI @ Intel | CS @ UIUC. I like to write about Tech and my experiences around it