[DL] 13. Convolution and Pooling Variants (Dilated Convolution, SPP, ASPP)
1. Dilated Convolution
Dilated convolution, which is also often called atrous convolution, was introduced in 2016 ICLR. Its major idea is that when performing convolution, not taking a look at directly adjacent pixels, but further away pixels by a certain distance.
This distance is called ‘Dilation rate r,’ and the dilated convolution can be mathematically represented as follows.
From the equation of standard convolution, only the term for dilation rate r is added, and as one might notice, if r is one, then the eq 1 is just a standard convolution.
Then what is the role of dilatation rate? This question can be well answered with the visualization below.
As mentioned before, dilated convolution becomes the standard convolution if r = 1.
However, things get different if r is larger than one. The following illustrates when r = 2, and we can observe that the convolution computes the weighted sum for pixels, which are far away by a distance of 2 to each other, in the input.
What is advantageous of using dilated convolution instead of the standard one?
As we have seen, the dilated convolution has an effect of increasing the receptive field without increasing computation. In other words, it can achieve the result of pooling without additional computation and losing the resolution.
2. Spatial Pyramid Pooling (SPP)
Given feature maps from previous convolution layers, SPP performs max-pooling operations multiple times with increasing kernel sizes. Afterward, each pooling operation results are transformed into a vector representation, and those vector representations are concatenated to produce the final fixed-length representation.
With multiple pooling layers at different scales, it captures information from varying image scale, leading to achieving the robustness to varying image scales. As a result, SPP creates a multi-scale feature representation that can be used for classification in the later parts of the network.
3. Atrous Spatial Pyramid Pooling (ASPP)
Another approach to building a multi-scale representation using the idea of looking at input image at varying scales is the Atrous Spatial Pyramid Pooling (ASPP). In other words, ASPP is an extension of the SPP concept making use of dilated convolutions instead of max pooling.
The difference is that ASPP does not decrease the resolution, meaning that the input and output sizes of ASPP are the same since the max-pooling operation does not exist in ASPP.
More specifically, while the output of SPP is in form of a single vector, the output of ASPP remains in form of 2D feature maps, as is shown below.
Further, as the dilated convolutions replace the standard convolutions in ASPP, the receptive fields are increased without extra computation.
Reference
[1] RWTH Aachen, computer vision group
Any corrections, suggestions, and comments are welcome.