Understanding “convolution” operations in CNN

Published in

Analytics Vidhya

5 min readJun 23, 2021

Convolution neural network is the major building block of deep learning, which helps in image classification, object detection, image recognition, etc of computer vision tasks. We use many convolution operation techniques that we will discuss in this article.

Have you ever used photo editing tools?

Which makes your image sharper or removes part of a photo that you wanted to remove?

If Yes, then you have implied convolution operation to your photo.

Understanding of convolution operation

What is kernel/filter?

The kernel is a rectangular small matrix, which slides over the image from left to right and top to bottom.

What is stride?

The number of pixels we slide over the input image by the kernel is called a stride.

What is a convolution operation?

The convolution operation is the process of implying a combination of two functions that produce the third function as a result, employing filters across the entire input image allows the filter to discover that feature of the image. Which is also called a feature map.

Let’s understand through a simple analogy of a Jigsaw puzzle. A jigsaw puzzle is a perfect example of a convolutional operation.

In a jigsaw puzzle, each piece has a portion of an image that reveals something about the complete picture when assembled.

Like a jigsaw puzzle in convolutional networks, multiple filters are taken to slice through the image and map them one by one and learn different portions of an input image.

For example, we have input image 3x4 as I and 2x2 kernel K, convolution is an element-wise multiplication of two matrices followed by a sum Sij.

We compute the output(re-estimated value of current pixel) using the following formula:

Here m and n represent the number of rows and columns

At each pixel of the original image I, we reestimate the neighborhood of pixels located at the center of the image kernel. We then take this neighborhood of pixels, convolve them with the kernel k, and obtain a single output value Sij. The kernel can be slide from left-to-right and top-to-bottom, of a larger image

The output of this operation would be: (aw + bx + ey + fz). Then we move the kernel horizontally with stride 1 which will give weight sum (bw + cx + fy + gz)

So, after this, the output from the first layer would look like this:

We have different filters like blurring (average smoothing, Gaussian smoothing, median smoothing, etc.), edge detection (Laplacian, Sobel, Scharr, Prewitt, etc.), and sharpening — all of these operations are designed to perform a particular function.

Let’s take examples of Gaussian smoothing filters, if we take the average(divide the weighted sum value by 9), it would dilute the value/blurs the image which is used for smoothing and reduction of noise in the image.

Gaussian smoothing filter

2D Convolution operation using 3D filter

In the case of 3D input(RGB image has 3 channels corresponding to Red, Green, Blue, all these channels are superimposed on each other and that’s how we get the final image), we have 3 channels(depth) one corresponding to each of the RGB in the image.

So here we are accounting convolution operation for all 3 channels separately?

Are we sliding filter along the depth too?

No, we use the filter of the same depth as the input and place the filter over the input and compute the weighted sum across all 3 dimensions.

Here our input image and kernel is 3D but that operation we are doing is a 2D operation because we are moving the kernel in 2 direction horizontally(left to right) and vertically(top to bottom).

Thus, we have learned that we can extract important features from the image using the convolution operation with filters. So, instead of using one filter, we can use multiple filters for extracting different features from the image and produce multiple feature maps. Every filter is responsible for extracting different features like horizontal edges, verticle edges, nonlinear features, etc.

What is padding?

At corners, we can not place the kernel so it is clear that the output of the convolution operation is smaller than the input image.

What if we want the output as the same size as the input? or

what if the filter does not fit on the input image?

In that case, we add artificial padding of 0’s around the image as per the required size and kernel size, shown in the below image which is also called Zero-padding.

For example, let’s say we are performing a convolution operation with a stride of 2. When we move our filter by two pixels, there are chances where it reaches the border and the filter does not fit the input image. we use zero paddings which retain image boundary.

Conclusion

Convolutional neural networks apply a filter to the input images to create feature maps that sum up the detected features in the input.
How to calculate the feature map for 2D and 3D-dimensional convolutional layers in a convolutional neural network.
How padding adds the border effect in the feature map which is created using different sizes of the filter.

Feel free to comment if you have any feedback for me to improve on, or if you want to share any thoughts or experiences on the same.

Do you want more? Follow me on LinkedIn, and GitHub.