Demystify Transposed Convolutional Layers
The transposed convolutional layer is widely applied in Auto Encoder and Generative Adversarial Networks (GAN), which serves as one way to upsample the data.
This is a simple concept, but in my learning process, I was confused by a lot of inconsistent materials. Therefore, I would like to create this tutorial, use both animation and PyTorch code to clearly explain the parameters and math of transposed convolutional layers. All source files are on my Github Repo.
Acknowledgment: The style of animation design is inspired by conv_arithmetic. (Although, it confused me a lot in my learning process.)
1.Default case
Let’s start with the simplest case.
Figure 1 shows the calculation process of a transposed convolutional layer with kernel_size to be 3, and other parameters set to default. The dimensions of input (2x2) and output (4x4) could be easily recognized.
Following is the by-step calculation process. As the animation shows, there are 4 steps to generate the final output.
Let’s verify the same calculation using PyTorch:
2.Stride
Next, we will change the parameter stride, leave everything else to be the same as the 1st case.
The PyTorch document indicates:
stride
controls the stride for the cross-correlation.
The document is only for reference. Personally, I did not understand it at first. But the following visualization should be clear. The default value of stride is 1, here we set stride to be 2.
As you can see, after each multiplication step, the kernel matrix moves 2 steps horizontally until it hits the end, and then move 2 steps vertically and start from the beginning.
Let’s see the calculation processes:
Let’s verify by pytorch:
3.Padding
We will keep building based on stride case, this time we change parameter padding to 1. In previous cases, the padding has default value 0.
The final output, in this case, is the center 3x3 matrix. You can interpret it as, after calculation, drop the border cells of the matrix. You should be able to imagine if we set padding equal 2, the result would be the center cell (1x1).
Figure 6 shows the calculation processes, as you can see, it is almost identical to figure 4. The only difference is we ‘removed’ the outer cells.
Let’s see if PyTorch agrees with us:
4.Output Padding
Yes, we have another kind of padding. Their difference is simple:
Output Padding adds cells to one side of the output, while padding removes cells from both sides of the output.
In this case, we set parameter output_padding to be 1 (default is 0), and stride to be 2. As shown in figure 7, one side of the output matrix has been added cells, which has value 0.
If you have any difficulties in understanding this, please feel free to compare figure 7 with figure 3.
Below is the calculation steps:
Let’s confirm with PyTorch again.
5.Dilation
Dilation influence the structure of the kernel matrix.
The PyTorch documentation puts,
dilation
controls the spacing between the kernel points;
I have no idea when I first saw this because it is very abstract. However, look at figure 9, you might be able to understand it. To make things easier, let’s use the 2x2 kernel in this example. (In the previous examples, we used 3x3 kernel.)
Above is what the kernel matrix looks like with different dilation values. Basically, if dilation value is n, then the kernel matrix will be interjected n-1 cells filled with 0. At this point, it should not be hard to imagine the same transformation for bigger kernel matrices. And the rest calculation remains the same as before, as shown in figure 10.
To clarify, in figure 10, I ignored 0-valued kernel cells by making it transparent.
Below is the calculation steps:
Below is the PyTorch implementation.
6.Math behind the output shape
Finally, let’s close this tutorial by deriving the formula of output size. You only need to read this section if you want to develop a deeper understanding of transposed convolutional layers, otherwise please feel free to skip.
The of the formula for output size is:
Where n is the output size (n x n matrix), m is the input size (m x m matrix). Besides, there are 5 parameters in the formula: K is the kernel size, S is the stride value, P is the padding value, D is the dilation value and P_out is the output_padding value.
It looks complicated, but it is in fact very simple. Let’s look it step by step.
(1) Only consider S (stride) and K (kernel size)
Because the input size is m, so we have m*m steps of calculation. But we really only need to consider the first m steps, since the first m step would fix the width of the output matrix.
We can imagine the output progressively grow as the calculation proceeds, just as shown in figure 2,4,6,8,11.
- In the 1st step, the output size is K.
- In the 2nd step, the intermedia matrix shift by S, so the output size is K + S.
- In the 3rd step, the intermedia matrix shift by S, so the output size is K + 2S.
- In the m-th step, the intermedia matrix shift by S, so the output size is K + (m-1)S.
Therefore, if we only consider S and K, the formula would be:
(2) Consider D (dilation)
As we have discussed, the dilation changes kernel size. Here, let’s use K’ to denote the transformed kernel size. As shown in figure 9, the dilation transformation interjects (K-1)(D-1) cells in the kernel. Therefore, the relationship between K’, K, D should be:
Thus we have:
Substitute K in (2) with K’, we have:
Now we are almost done, the rest parameters are easy to understand.
(3) Consider P (padding) and P_out (output_padding)
Since padding removes cells in 2 sides, so its influence on the output size is -2P. Similarly, output_padding add cells in 1 side, so its influence on the output size is +P_out. Adding these pieces into (4) we have derived (1).
That concludes this tutorial. Thanks for reading! I hope this tutorial has helped you developed a deeper understanding of transposed convolutional layers.
Please feel free to leave comments. All suggestions, questions are welcomed!