[DL] 12. Unsampling: Unpooling and Transpose Convolution
When we study neural network architectures based on encoder and decoder, it is commonly observed that the network performs downsampling in the encoder and upsampling inside the decoder, as is illustrated in Fig 1.
Methods for downsampling are max pooling and strided convolution. However, what are the available approaches for upsampling?
This article mainly addresses two upsampling methods based on pooling and convolution, respectively.
The first type of unsampling is Unpooling which takes the idea of pooling. The max-pooling operation takes only the largest response from each sub-divided regions of the feature map. Fig 2 shows the max pooling operation given 4⨯4 input feature map.
Unpooling, as its name implies, attempts to perform exactly the opposite, restoring the size of original input feature map (in Fig2, from 2⨯2 to 4⨯4).
To do this, there are three ways. For the sake of simplicity and consistency, I assume the size of original feature map is 4⨯4, and it becomes 2⨯2 after max pooling.
Nearest-Neighbor is the simplest approach to upsample. It copies a pixel value (response) of input feature map to all pixels in the corresponding sub-region of output.
Although its simplicity, the problem of this approach is that the output structure becomes blocky as all pixels in each subregion have the same value.
2.2 Bed of Nails
Another de-pooling method, the Bed-of-Nails operation put an input feature response (element) in the top-left corner of the corresponding sub-region of de-pooling output, and it sets all the other elements in the sub-region to zero.
By doing so, it achieves the fine-grained output structure. However, the upsampled elements always have a fixed location, which is the upper-left corner in the case of Fig 4.
2.3 Max Unpooling
In order to supplement the problem of Bed of Nails, Max Unpooling is introduced. While Max Unpooling performs upsampling in a similar manner as Bed of Nails, it remembers the indices of where the largest elements come from before max pooling. And this information is used later on when Max Unpooling is performed to place the elements in the positions of each sub-region where they are previously located before max pooling.
3. Transpose Convolution
We have taken a look at upsampling approaches based on unpooling. As one might notice, the previously mentioned three methods are fixed numerical equations (functions) and there is no learning taking place.
Another type of upsampling, which makes use of learning, is called transpose convolution.
Let’s recall first the conventional convolution with the following example. Given the image and filter, it computes the dot product for overlapping pixels in the window. Subsequently, the window moves 2 pixels (as stride is 2) and repeats the computing dot product. In the backpropagation step, the parameters in the filter get updated and learn from mistakes (error) they made.
As a result of strided convolution operation, the resolution decreases from 4⨯4 to 2⨯2.
In order to undo the reduction in resolution and upscale it back to the original size, The transpose convolution is introduced. For better understanding, let’s take a look at the 1D example of transpose convolution.
Suppose we have 2⨯1 input, 3⨯1 filter, and transpose convolution with the stride of 2. Then the output of the operation has the size of 5⨯1, which is obtained by copying each input value weighted by the filter, summing the overlaps, as is illustrated in Fig 7.
Clearly, the filter parameters, x,y, and z, are updated in the backpropagation step, and this is why the transpose convolution is a learnable upsampling approach.
Note that if the stride is one, then the output size becomes 4⨯1.
3.1 Why it is named ‘transpose’ convolution?
To answer this question, we need to represent the normal convolution operation in a form of matrix multiplication.
Assume we convolve the 4⨯1 input with the 3⨯1 filter and stride is set to one, as is shown below. We can represent this convolution operation as a multiplication of the same input with the matrix X.
After the convolution, the output size is the same as input size, which is 4⨯1, since our stride is set to one.
What happens if we compute the same multiplication but with transposed X?
The following is the result. As we can observe, the output size becomes 5⨯1, when the input size remains 4⨯1.
The behavior of transpose convolution becomes more clear when we have an example with the stride of 2.
 Prof. Svetlana Lazebnik
 Prof M. Irani and R. Basri
 Prof. Li fei-fei
 Denis Simakov
 Alexej Efros
Any corrections, suggestions, and comments are welcome