# Deep CV: Advanced Convolutional Layers

In the world of Deep Computer Vision, there are several types of convolutional layers that differ from the original convolutional layer which was discussed in the previous Deep CV tutorial. These layers are used in many popular advanced convolutional neural network implementations found in the Deep Learning research side of Computer Vision. Each of these layers has a different mechanism than the original convolutional layer and this allows each type of layer to have a particularly special function.

Before getting into these advanced convolutional layers, let’s first have a quick recap on how the original convolutional layer works.

## Original Convolutional Layer

In the original convolutional layer, we have an input that has a shape (W*H*C) where **W** and **H** are the **width** and **height** of each feature map and **C** is the **number of channels**, which is basically the total number of feature maps. The convolutional layer will have a certain number of kernels which will perform the convolution operation on this input. The **number of kernels** will be **equal** to the **number of desired channels** in the **output feature map**. Basically, **each kernel** will correspond to a **particular feature map** in the output and** each feature map is a channel**.

The height and width of the kernel is something that we decide, and usually, we keep it as 3*3. The **depth** of each kernel will be equal to the number of **channels of the input**. Hence for the below example, the shape of each kernel will be (w*h*3), where w and h are the width and height of the kernels and the depth is 3 because the input, in this case, has 3 channels.

In this example, the input has 3 channels and the output has 16 channels. Hence there are a total of 16 kernels in this particular layer and each kernel has a shape of (w*h*3).

# Advanced Convolutional Layers

The list of advanced convolutional layers that we will be covering in this tutorial are as follows:

- Depthwise Separable Convolutional Layer
- Deconvolutional Layers
- Dilated Convolution
- Grouped Convolution

## Depthwise Separable Convolutional Layer

In the depthwise separable convolution layer, we are trying to **drastically reduce the number of computations** that are being performed in each convolutional layer. This entire layer is actually divided into two parts: i) depthwise convolution, ii) pointwise convolution.

**Depthwise Convolution:**

The key point of difference in depthwise convolution is that each kernel is applied on a **single channel** of the input and **not** **all the input channels** at once. Hence, each kernel will be of shape (w*h*1) since it will be applied on a single channel. The **number of kernels** will be **equal** to the **number of input channels**, hence if we have a W*H*3 size input, we will have 3 separate w*h*1 kernels and each kernel will be applied to a single channel of the input. Hence the **output will also have the same number of channels** as the input since each kernel will output a single feature map. Let’s have a look at how the depthwise convolution part works:

So if we have an input with C channels, the output of the depthwise convolution part of this layer will also have C channels. Now comes the next part. This part is aimed at **changing the number of channels** because we often want to increase the number of channels each layer has as an output as we go deeper into the CNN.

**Pointwise Convolution:**

Pointwise convolution will convert this intermediate C channel output of the depthwise convolution into a feature map with a **different number of channels**. To do this we have **several 1*1 kernels** that are convolved across **all channels** of this intermediate feature map block. Hence each 1*1 kernel will have C channels as well. Each of these kernels will output a separate feature map, hence we will have the **number of kernels equal to the number of channels** we want the output to have. Let’s have a look at how this works.

That sums up the entire process of depthwise separable convolutional layers. Basically, in the first step of depthwise convolution, we have **1 kernel for each input channel** and convolve these with the input. The resultant output of this will be a feature map block with the **same number of channels as the input**. In the second step of the pointwise convolution, we have **several 1*1 kernels** and convolve these with the intermediate feature map block. We will choose the number of kernels according to the number of channels we want the output to have.

This layer is much **computationally cheaper** than the original convolutional layer. This is because, in the first step, instead of having huge kernels which are convolved over all the channels of the input, we are only using **single-channel kernels**, which will be **much smaller**. And then in the next step when we are trying to change the number of the channels, we are using kernels which are convolved over **all the channels**, but these kernels are 1*1, hence they are also **much smaller**. Essentially, we can think of depthwise separable convolution as dividing the original convolutional layer into **2 separate parts**. The first part uses kernels with a **larger spatial area** (width and height) but having only a **single channel**, the second part uses kernels which are across **all the channels**, but they have a **smaller spatial area**.

Depthwise separable convolutional layers are used in MobileNets, which are CNNs aimed to have much lesser parameters so that they can be used on mobile devices. They are also used in the Xception CNN architecture.

## Deconvolutional Layers

Usually in convolutional layers, the spatial area (width and height) of the feature maps either **decrease** or **stay the same** after each layer. But sometimes we want to **increase our spatial area**. These special layers which increase spatial area instead of decreasing them are called **deconvolutional layers**. There are 2 main types of deconv layers:

- Transposed Convolution
- Upsampling

Both are similar in certain aspects but have some differences as well. Essentially the aim is to increase the spatial area by **introducing more pixels** in the feature map **before applying convolutions**. The way these new pixel values are **filled** forms the main difference between transposed convolution and upsampling. The way new pixels are added is as follows:

We can alter the upscale ratio when resizing a feature map, but generally, we do a 2x upscale. With this, the height and width of the feature map will double, and hence the total number of pixels with be 4 times the original feature map.

**Transposed Convolution:**

In the case of transposed convolution, we simply fill all the added pixels with the **value of 0**. It is kind of like adding padding between the original pixels of the feature map.

After substituting all the added pixels with 0s, we then perform an **ordinary convolution** on the resultant enlarged feature map. This is how we can increase the feature map size while performing a convolution operation on it.

**Upsampling:**

In the case of the upsampling layer, we **replicate the original pixel values** in the place of the added pixels. Hence each pixel will be replicated 4 times if we are doing a 2x upscale. Technically if we consider just the upsampling layer, there is **no convolution** after upscaling the feature map. But we generally **add convolution layers ourselves** after the upsampling layer so that the network has some **learning capability** because the upsampling layer by itself does not have any parameters.

Both of these layers are used widely in CNNs which try to **output a feature map that is the same size as the original input**. Generally, there will be a few ordinary convolutions and pooling layers which will **decrease** the size of the feature maps. After this, we will introduce deconvolutional layers to **increase** the size back to the original size. Semantic segmentation CNNs, U-Net, GANs, etc use deconvolutional layers.

## Dilated (Atrous) Convolution

Dilated convolutions are also known popularly in the research community as **atrous convolutions**. In dilated convolution, we essentially try to **increase the area** of each kernel while keeping the **number of elements** each kernel has exactly the **same**.

For dilated convolution, we basically take the kernel and we **add spacing** in between the elements of the kernel before we perform the convolution operation. By doing this, the **receptive area of the kernel increases** while having the **same number of parameters**.

This is how it looks compared with an ordinary convolutional layer:

As you can see, the **number of elements** in the kernel stays the **same**, but the effective area over which the kernel gets applied on **increases from 3*3 to 5*5**. We can also alter the **dilation rate** of the kernel, which essentially means how **wide the gaps** between the kernel elements will be. For the dilated kernel in the example above, the dilation rate is 2. The default convolution kernel has a dilation rate of 1, which basically means there are **no gaps** between the kernel elements.

We use dilated convolution when we want the convolutions to be applied over a **larger area** while remaining **computationally cheap**. If we wanted to cover an area of 5*5 with a normal convolutional layer, then we would require a kernel with a 5*5 area, which is **25 elements**. However, if we use dilated convolution with a dilation rate of 2, we can cover the same area with only **9 elements**. Apart from that, the receptive area of the kernels is increased, making them capable of **capturing finer details** present in the input feature map.

## Grouped Convolution

In grouped convolution, the basic concept is that we divide the channels present in the input into **equally divided groups**. We will then assign an **equal amount of kernels** to each of these groups. Each kernel will be applied **only on the channels in its respective group** and **not on all the channels** of the input.

For example, if we have an input feature map that has **4 channels** and we want to have a total of **2 groups**, each group will have** 2 channels** each. Let’s assume we have **4 kernels** for each group. Each kernel will have a **depth of 2** since they will be applied only to **each group** and **not the entire input**. The output feature maps of both groups will be concatenated together to form the final output feature maps. Hence, in this case, there will be a total of **4 feature maps** output from each group and hence the **total number** of channels in the output will be** 8**. Let’s look at the visual explanation for this example:

With grouped convolution, we are essentially performing** convolutions in parallel** in each layer. This increases the number of paths the model can take while backpropagating through the network. Apart from this, it **reduces the computational cost** of the layer as well since each kernel will be having significantly **lesser parameters** and will be applied on **fewer channels** in the input. These are the reasons why we use Grouped Convolutions. These are used in the ResNext architecture.

That concludes this tutorial on Advanced Convolutional Layers. Thank you for reading!