Convolution, Padding, Stride, and Pooling in CNN

Published in

Analytics Vidhya

4 min readJun 25, 2020

Convolution operation

The convolution is a mathematical operation used to extract features from an image. The convolution is defined by an image kernel. The image kernel is nothing more than a small matrix. Most of the time, a 3x3 kernel matrix is very common.

In the below fig, the green matrix is the original image and the yellow moving matrix is called kernel, which is used to learn the different features of the original image. The kernel first moves horizontally, then shift down and again moves horizontally. The sum of the dot product of the image pixel value and kernel pixel value gives the output matrix. Initially, the kernel value initializes randomly, and its a learning parameter.

Illustration of the convolution operation

There are some standard filters like Sobel filter, contains the value 1, 2, 1, 0, 0, 0, -1, -2, -1, the advantage of this is it puts a little bit more weight to the central row, the central pixel, and this makes it maybe a little bit more robust. Another filter used by computer vision researcher is instead of a 1, 2, 1, it is 3, 10, 3 and then -3, -10, -3, called a Scharr filter. And this has yet other slightly different properties and this can be used for vertical edge detection. If it is flipped by 90 degrees, the same will act like horizontal edge detection.

In order to understand the concept of edge detection, taking an example of a simplified image.

So if a 6*6 matrix convolved with a 3*3 matrix output is a 4*4 matrix. To generalize this if a 𝑚 ∗ 𝑚 image convolved with 𝑛 ∗ 𝑛 kernel, the output image is of size (𝑚 − 𝑛 + 1) ∗ (𝑚 − 𝑛 + 1).

Padding

There are two problems arises with convolution:

Every time after convolution operation, original image size getting shrinks, as we have seen in above example six by six down to four by four and in image classification task there are multiple convolution layers so after multiple convolution operation, our original image will really get small but we don’t want the image to shrink every time.
The second issue is that, when kernel moves over original images, it touches the edge of the image less number of times and touches the middle of the image more number of times and it overlaps also in the middle. So, the corner features of any image or on the edges aren’t used much in the output.

So, in order to solve these two issues, a new concept is introduced called padding. Padding preserves the size of the original image.

So if a 𝑛∗𝑛 matrix convolved with an f*f matrix the with padding p then the size of the output image will be (n + 2p — f + 1) * (n + 2p — f + 1) where p =1 in this case.

Stride

Stride is the number of pixels shifts over the input matrix. For padding p, filter size 𝑓∗𝑓 and input image size 𝑛 ∗ 𝑛 and stride ‘𝑠’ our output image dimension will be [ {(𝑛 + 2𝑝 − 𝑓 + 1) / 𝑠} + 1] ∗ [ {(𝑛 + 2𝑝 − 𝑓 + 1) / 𝑠} + 1].

Pooling

A pooling layer is another building block of a CNN. Pooling Its function is to progressively reduce the spatial size of the representation to reduce the network complexity and computational cost.

There are two types of widely used pooling in CNN layer:

Max Pooling
Average Pooling

Max Pooling

Max pooling is simply a rule to take the maximum of a region and it helps to proceed with the most important features from the image. Max pooling selects the brighter pixels from the image. It is useful when the background of the image is dark and we are interested in only the lighter pixels of the image.

Average Pooling

Average Pooling is different from Max Pooling in the sense that it retains much information about the “less important” elements of a block, or pool. Whereas Max Pooling simply throws them away by picking the maximum value, Average Pooling blends them in. This can be useful in a variety of situations, where such information is useful.

Convolution, Padding, Stride, and Pooling in CNN

Convolution operation

Padding

Stride

Pooling

Written by Abhishek Kumar Pandey