Convolutional Neural Networks — Part 2: Padding and Strided Convolutions

Brighton Nkomo
The Startup
Published in
7 min readOct 2, 2020
credit: Nagesh Singh Chauhan, KD Nuggets

This is the second part of my blog post series on convolutional neural networks. Here are the subsequent parts of this series:

A pre-requisite here is knowing matrix convolution. I have briefly explained matrix convolution in the first section of part one of this series. Don’t be intimidated by the word “convolution” and the amount of matrices that you’ll see. I’ve used the asterisk to denote the convolution operator, not matrix multiplication. By the way, I think that matrix convolution is simpler than matrix multiplication.

1. Padding

An athlete tries on a pair of the Shoulder Pads.

In plain English, a padding is “a piece of material used to protect something or give it shape.” The 2 subsections here discuss why it’s necessary to “cover” an input matrix with a border of zeros and the formula for determining the “padding amount.”

1.1 Motivation

FIGURE 1

In part one we saw that if you take a 6 by 6 image and convolve it with a 3 by 3 filter, you end up with a 4 by 4 output (with a 4 by 4 matrix), and that’s because the number of possible positions with the 3 by 3 filter, there are only, sort of, 4 by 4 possible positions, for the 3 by 3 filter to fit in your 6 by 6 matrix. And the math turns out that if you have a n by n image and convolve it with an f by f filter, then the dimension of the output will be (n — f+1) by (n — f+1). And so for the 6 by 6 image and the 3 by 3 filter on the left, the output image size would be a (6–3+1) by (6–3+1) which simplifies to a 4 by 4 output. You can see all this summarized in figure 1.

There are two problems when you convolve a filter to an image:

1. The shrinking output: If every time you apply a convolutional operator, your image shrinks, so you come from 6 by 6 down to 4 by 4 then, you can only do this a few times before your image starts getting really small, maybe it shrinks down to 1 by 1 or something, so maybe, you don’t want your image to shrink every time you detect edges or to set other features on it.

FIGURE 2: How information is lost on the edges. Note that the red-shaded center pixel overlaps with many (red) filters while the green-shaded pixel can only overlaps with just one (green) filter.

2. Throwing away information from the edges of the image: Looking at figure 2, we can see that the green pixel on the upper left can only overlap with one filter when you convolve it, whereas, if you take a pixel in the middle, say the red pixel, then there are a lot of 3 by 3 regions (filters) that overlap with that pixel and so, it’s as if pixels on the corners or on the edges are use much less used to compute the output pixels. Hence, you’re throwing away a lot of the information near the edges of the image.

In order to solve both of these problems, both the shrinking output and throwing away a lot of the information from the edges of the image,

FIGURE 3: Padding the 6 by 6 filter.

what you can do is before applying the convolutional operation, you can pad the image with an additional one border as shown in figure 3, with the additional border of one pixel all around the edges as shown on the left image(in this case, sometimes you might need more than one border). So, if you do that, then instead of a 6 by 6 image, you’ve now padded this to 8 by 8 image and if you convolve an 8 by 8 image with a 3 by 3 image you now get a 6 by 6 image output. So you managed to preserve the original input size of 6 by 6 (For those familiar with deep learning layer concept: you see that this is especially useful because when you build really deep neural networks, you don’t want the image to shrink on every step because if you have, maybe a 100 layer of deep net, then it’ll shrinks a bit on every layer, then after a hundred layers you end up with a very small image).

GIF 1: Just an illustration that the 6 by 6 input has a padding of zeros around it.

By convention when you pad, you padded with zeros and p is the padding amount (in this case, p = 1, because we’re padding all around with an extra boarder of one pixels), then the output becomes (n+2p — f +1) by (n + 2p — f +1). So, in this case we have that the output size when p=1 is (6+2— 3+1) by (6 + 2 — 3+1) which simplifies to just a 6 by 6 image. So you end up with a 6 by 6 image that preserves the size of the original image. See GIF 1.

You can also pad the border with two pixels, in that case you add on another border and they can pad it with even more pixels if you choose.

1.2 Valid and same convolution

In terms of how much to pad, there are two choices valid convolution and same convolution.

  • Valid convolution this basically means no padding (p=0) and so in that case, you might have n by n image convolve with an f by f filter and this would give you an n minus f plus one by n minus f plus one dimensional output.
  • Same convolution means when you pad, the output size is the same as the input size. Basically you pad, let’s say a 6 by 6 image in such a way that the output should also be a 6 by 6 image. So in general, n+2p - f+1 = n, since input image size = output image size. In this case, the formula for finding the right number of pads to use is given below.

2. Striding

A man running on a road.

In plain English, a stride is a step that you take when walking or running.

FIGURE 4: You get 91 as the first entry of your output when you convolve that 7 by 7 matrix with the 3 by 3 filter.

If you want to convolve, let’s say, a7 by 7 image with this 3 by 3 filter with a stride of two, what that means is you take the element Y’s product as usual in this upper left 3 by 3region and then multiply and add (that gives you 91 as the first element of the output). See Figure 4.

GIF 2: An illustration of what happens when the stride = 2.

But then instead of stepping the blue box over by one step, we are going to step over by two steps. So, we are going to make it hop over two steps like on GIF 2. Notice how the blue box jump over by two steps. And then you do the usual element Y’s product (summing it turns out 100)and now we are going to do they do that again, and make the blue box jump over by two steps (summing it up gives you 83).

GIF 3: If the stride = 2. Then take two steps down.

Now, when you go to the next row, you again actually take two steps instead of one step (see GIF 3). Notice how the blue box moved 2 steps below. Repeating the same process gives 69, 91, 127. And then for the final row 44, 72, and 74.

And if you use padding p and stride s. In this example, s = 2 then you end up with an output that is n+ 2 — f , and now because you’re stepping S steps of the time, you step just one step of the time, you now divide by S plus one and then can apply the same thing.

Now, just one last detail, what if this (n+2p — f) /s fraction is not an integer? In that case, we’re going to round the fraction down (i.e take the floor of the fraction).

GIF 4: Do not do the computation if the filter (the blue box) is NOT fully contained within the image.

What does taking the floor of this (n+2p — f) /s fraction mean? The way this is implemented is that you take this type of blue box multiplication only if the blue box is fully contained within the image or the image plus to the padding and if any of this blue box kind of part of it hangs outside and you just do not do that computation. See GIF 4.

Thank you for your attention and if you have any feedback/questions please send a comment. Clap and share if you liked the manner in which I presented the padding and strided convolution ideas here. :)

REFERENCES:

--

--