In the world of Deep Computer Vision, there are several types of convolutional layers that differ from the original convolutional layer which was discussed in the previous Deep CV tutorial. These layers are used in many popular advanced convolutional neural network implementations found in the Deep Learning research side of Computer Vision. Each of these layers has a different mechanism than the original convolutional layer and this allows each type of layer to have a particularly special function.
Before getting into these advanced convolutional layers, let’s first have a quick recap on how the original convolutional layer works.
Original Convolutional Layer
In the original convolutional layer, we have an input that has a shape (W*H*C) where W and H are the width and height of each feature map and C is the number of channels, which is basically the total number of feature maps. The convolutional layer will have a certain number of kernels which will perform the convolution operation on this input. The number of kernels will be equal to the number of desired channels in the output feature map. Basically, each kernel will correspond to a particular feature map in the output and each feature map is a channel.
The height and width of the kernel is something that we decide, and usually, we keep it as 3*3. The depth of each kernel will be equal to the number of channels of the input. Hence for the below example, the shape of each kernel will be (w*h*3), where w and h are the width and height of the kernels and the depth is 3 because the input, in this case, has 3 channels.
In this example, the input has 3 channels and the output has 16 channels. Hence there are a total of 16 kernels in this particular layer and each kernel has a shape of (w*h*3).
Advanced Convolutional Layers
The list of advanced convolutional layers that we will be covering in this tutorial are as follows:
- Depthwise Separable Convolutional Layer
- Deconvolutional Layers
- Dilated Convolution
- Grouped Convolution
Depthwise Separable Convolutional Layer
In the depthwise separable convolution layer, we are trying to drastically reduce the number of computations that are being performed in each convolutional layer. This entire layer is actually divided into two parts: i) depthwise convolution, ii) pointwise convolution.
The key point of difference in depthwise convolution is that each kernel is applied on a single channel of the input and not all the input channels at once. Hence, each kernel will be of shape (w*h*1) since it will be applied on a single channel. The number of kernels will be equal to the number of input channels, hence if we have a W*H*3 size input, we will have 3 separate w*h*1 kernels and each kernel will be applied to a single channel of the input. Hence the output will also have the same number of channels as the input since each kernel will output a single feature map. Let’s have a look at how the depthwise convolution part works:
So if we have an input with C channels, the output of the depthwise convolution part of this layer will also have C channels. Now comes the next part. This part is aimed at changing the number of channels because we often want to increase the number of channels each layer has as an output as we go deeper into the CNN.
Pointwise convolution will convert this intermediate C channel output of the depthwise convolution into a feature map with a different number of channels. To do this we have several 1*1 kernels that are convolved across all channels of this intermediate feature map block. Hence each 1*1 kernel will have C channels as well. Each of these kernels will output a separate feature map, hence we will have the number of kernels equal to the number of channels we want the output to have. Let’s have a look at how this works.
That sums up the entire process of depthwise separable convolutional layers. Basically, in the first step of depthwise convolution, we have 1 kernel for each input channel and convolve these with the input. The resultant output of this will be a feature map block with the same number of channels as the input. In the second step of the pointwise convolution, we have several 1*1 kernels and convolve these with the intermediate feature map block. We will choose the number of kernels according to the number of channels we want the output to have.
This layer is much computationally cheaper than the original convolutional layer. This is because, in the first step, instead of having huge kernels which are convolved over all the channels of the input, we are only using single-channel kernels, which will be much smaller. And then in the next step when we are trying to change the number of the channels, we are using kernels which are convolved over all the channels, but these kernels are 1*1, hence they are also much smaller. Essentially, we can think of depthwise separable convolution as dividing the original convolutional layer into 2 separate parts. The first part uses kernels with a larger spatial area (width and height) but having only a single channel, the second part uses kernels which are across all the channels, but they have a smaller spatial area.
Depthwise separable convolutional layers are used in MobileNets, which are CNNs aimed to have much lesser parameters so that they can be used on mobile devices. They are also used in the Xception CNN architecture.
Usually in convolutional layers, the spatial area (width and height) of the feature maps either decrease or stay the same after each layer. But sometimes we want to increase our spatial area. These special layers which increase spatial area instead of decreasing them are called deconvolutional layers. There are 2 main types of deconv layers:
- Transposed Convolution
Both are similar in certain aspects but have some differences as well. Essentially the aim is to increase the spatial area by introducing more pixels in the feature map before applying convolutions. The way these new pixel values are filled forms the main difference between transposed convolution and upsampling. The way new pixels are added is as follows:
We can alter the upscale ratio when resizing a feature map, but generally, we do a 2x upscale. With this, the height and width of the feature map will double, and hence the total number of pixels with be 4 times the original feature map.
In the case of transposed convolution, we simply fill all the added pixels with the value of 0. It is kind of like adding padding between the original pixels of the feature map.
After substituting all the added pixels with 0s, we then perform an ordinary convolution on the resultant enlarged feature map. This is how we can increase the feature map size while performing a convolution operation on it.
In the case of the upsampling layer, we replicate the original pixel values in the place of the added pixels. Hence each pixel will be replicated 4 times if we are doing a 2x upscale. Technically if we consider just the upsampling layer, there is no convolution after upscaling the feature map. But we generally add convolution layers ourselves after the upsampling layer so that the network has some learning capability because the upsampling layer by itself does not have any parameters.
Both of these layers are used widely in CNNs which try to output a feature map that is the same size as the original input. Generally, there will be a few ordinary convolutions and pooling layers which will decrease the size of the feature maps. After this, we will introduce deconvolutional layers to increase the size back to the original size. Semantic segmentation CNNs, U-Net, GANs, etc use deconvolutional layers.
Dilated (Atrous) Convolution
Dilated convolutions are also known popularly in the research community as atrous convolutions. In dilated convolution, we essentially try to increase the area of each kernel while keeping the number of elements each kernel has exactly the same.
For dilated convolution, we basically take the kernel and we add spacing in between the elements of the kernel before we perform the convolution operation. By doing this, the receptive area of the kernel increases while having the same number of parameters.
This is how it looks compared with an ordinary convolutional layer:
As you can see, the number of elements in the kernel stays the same, but the effective area over which the kernel gets applied on increases from 3*3 to 5*5. We can also alter the dilation rate of the kernel, which essentially means how wide the gaps between the kernel elements will be. For the dilated kernel in the example above, the dilation rate is 2. The default convolution kernel has a dilation rate of 1, which basically means there are no gaps between the kernel elements.
We use dilated convolution when we want the convolutions to be applied over a larger area while remaining computationally cheap. If we wanted to cover an area of 5*5 with a normal convolutional layer, then we would require a kernel with a 5*5 area, which is 25 elements. However, if we use dilated convolution with a dilation rate of 2, we can cover the same area with only 9 elements. Apart from that, the receptive area of the kernels is increased, making them capable of capturing finer details present in the input feature map.
In grouped convolution, the basic concept is that we divide the channels present in the input into equally divided groups. We will then assign an equal amount of kernels to each of these groups. Each kernel will be applied only on the channels in its respective group and not on all the channels of the input.
For example, if we have an input feature map that has 4 channels and we want to have a total of 2 groups, each group will have 2 channels each. Let’s assume we have 4 kernels for each group. Each kernel will have a depth of 2 since they will be applied only to each group and not the entire input. The output feature maps of both groups will be concatenated together to form the final output feature maps. Hence, in this case, there will be a total of 4 feature maps output from each group and hence the total number of channels in the output will be 8. Let’s look at the visual explanation for this example:
With grouped convolution, we are essentially performing convolutions in parallel in each layer. This increases the number of paths the model can take while backpropagating through the network. Apart from this, it reduces the computational cost of the layer as well since each kernel will be having significantly lesser parameters and will be applied on fewer channels in the input. These are the reasons why we use Grouped Convolutions. These are used in the ResNext architecture.
That concludes this tutorial on Advanced Convolutional Layers. Thank you for reading!