7 Different Convolutions for designing CNNs that will Level-up your Computer Vision project

In-depth review on Basic, Transposed, Dilated, Separable, Depthwise, and Pointwise convolutions and their applications.

Published in

CodeX

8 min readOct 29, 2021

Recent research on CNN architectures include so much different variants of convolutions that confused me while reading these papers. I thought it would be worth it to go through the precise definitions, effects, and use cases(in computer vision & deep-learning) of some of the more popular convolution variants. These variants are designed to save parameters count, boost inference, and exploit some specific characteristics of the target problem.

Most of these variants are simple and easy to understand, thus I focus on understanding the benefits and use cases of each method. This knowledge is expected to help you understand the intuitions behind recent CNN architectures and help designing your own network.

Convolutions

Let’s have a short overview on the basic form of convolution. According to the description in PapersWithCode,

A convolution is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.

Such operations are advantageous for processing images because:

They are extremely parameter-efficient because the same weights are shared for different positions of the image, thus the number of parameters isn’t proportion to the image size.
Convolution is fundamentally translation invariant. That is, the output isn’t effected by small and large translations which are common in images unlike MLPs, which often gives very different results for a 1-pixel translation.

The output shape and complexity of the convolutions can be configured using the following parameters:

Kernel size: The dimension of the kernel, typically a kernel size of (3×3) is used.
Padding: How to fill the edges of the images to maintain the image size after convolution. For example, the demonstration above uses 1 pixel of padding. Describes the number of pixels and the rule to fill those pixels.
Strides: The step size of the kernel when scanning the image. Typically set to 1 to maintain the data shape or 2 for downsampling it. The demonstration above uses a stride of 2.

Each output channel is predicted by combining the results of each channel convolved through a different kernel. Thus, C_in kernels of shape K×K is needed to compute one output channel. Where K denotes the kernel size and C_in, C_out each denotes the number of input and output channels.

# Parameters: K×K×C_in×C_out

Computation: H×W×C_in×C_out×K×K (In the case of stride=1)

Use cases: Such convolution layers are used on literally every sub-task of computer vision. These include supervised tasks such as image and video classification, object detection, segmentation and synthesis tasks such as image generation, image super-resolution, image-to-image transfer. There are also applications out of vision such as 1D convolutions for sequence modeling and 3D related applications.

Pointwise convolution(1x1 convolution)

Pointwise convolution is another word for a convolution layer with 1×1 kernels. They are also denoted as convolution over channel or a projection layer. Why on earth would someone use this? There are two main use cases:

For changing the dimensionality(a.k.a number of channels) of the input.

Some networks like Inception concatenate features computed from different kernels, which result in too much channels, thus a pointwise convolution is applied to manage the number of channels.
Compute-heavy modules like a self-attention module such as squeeze-and-excitation are more feasible when the features are compressed using pointwise convolution.
We sometimes need to match the number of channels when combining two inner products with element-wise sum or product.

The operation can be viewed as computing multiple weighted sums along the depth of the input feature maps. It can effectively summarize them.

2. It creates channel-wise dependencies with a negligible cost. This is especially exploited by combining with depthwise convolution, which lacks such dependencies.

# Parameters: C_in×C_out

Computation: H×W×C_in×C_out

Transposed convolution(Deconvolution/Inverse convolution)

Deconvolution explicitly computes the mathematical inverse of the convolution layer. While it is popularly used in class vision or signal processing, it isn’t important in deep-learning since the parameters of the operation can be learned learned through gradient descent.

Transposed convolution is a simpler approach to upsample the image size using convolutions. The operation is no different from classic convolution when the stride is 1(left). For a stride of n>1, the output shape is expanded by a factor of n. This is done by filling 0s between pixels to create an expanded image of the desired size and performing convolution on the expanded image.

While transpose convolution doesn’t implicitly compute the inverse of the convolution, it doesn't matter for deep learning because the filter needed (which could be the inverse filter) can always be learned via gradient descent. It sufficiently fulfills the function of increasing the spatial size of the data.

Important: While they are often confounded, transpose convolution is not deconvolution/inverse convolution.

# Parameters: K×K×C_in×C_out)

Use cases: Transpose convolution are used in network architectures that need upsampling. Some examples are usages in encoder-decoder style networks used for semantic segmentation, auto encoders or image synthesis and generation networks. One issue of transpose convolution is the checkerboard artifact that can be problematic for image generation/synthesis. The topic is out of the scope of this post and deserves one of its own. For more information, refer to this article from Google brain.

Sources:

Dilated convolution(Atrous convolution)

A receptive field is the range of the original image the model can refer to for making inference on one pixel of that step. For example, the output of a model with one 3×3 convolution can consider information from a receptive field of 3 pixels relative to the spatial location of each pixel, while a model with two 3×3 convolution has a receptive field of 5 pixels relative to the location.

Increasing the kernel size is one way to increase the receptive field, but the computation also increases very quickly. Downsampling the image also has the effect of increasing the receptive field because a 3×3 convolution in e.g. a 8×8 feature map covers more of the image. Three 3×3 convolutions are enough to consider the whole image for inference in 8×8 feature space.

Computing features on a lower spatial dimension is mostly fine for image classification, but it causes significant information loss for tasks that has high-resolution outputs, semantic segmentation in particular.

Dilated convolution is a type of convolution where the pixels of the kernel are spaced(filled with 0). The spacing is also considered a hyper-parameter, which often has varying values from 2 as in the demonstration above to large spaces like 24 in DeepLab models. It increases the kernel size without trivial increase in computation. This design enables extremely efficient computation from a larger receptive field without information loss or the need of increasing the number of layers.

Use cases of dilated convolution https://paperswithcode.com/method/dilated-convolution

Use cases: Shows most significant usage in semantic segmentation, but is also considered in lightweight/mobile CNN architectures for other tasks.

Proposed in: Multi-Scale Context Aggregation by Dilated Convolutions

Spatial separable convolution(Separable convolution)

Some 3×3 matrices can be represented as a matrix multiplication of two vectors. Since the 3×3 kernel is also a common matrix, it could be split into one 3×1 and one 1×3 kernel but perform the same operation.

Specifically, spatial separable convolution replaces the original convolution into two stages as described in the figure above. This way, the number of parameters and number of operations for each kernel shrinks from 9(3×3) to 6. However, it is known that not all 3×3 kernels can be separated, and thus spatial separable convolution could limit the capability of the model.

# Parameters: (K+K)×C_in×C_out

Computation: H×W×C_in×C_out×(K+K)

Use cases: Since the parameter count is much smaller, spatial separable convolutions are sometimes used for model compression and lightweight architectures.

Source:

Depthwise convolution

Instead of convolving and combining the results of every channel, depthwise convolution is performed channel independently over each channel, and the results are stacked. We can intuitively see that this would only work when the number of input and output channels are consistent.

Depthwise convolution is highly parameter and compute efficient, since both the number of parameter and computational complexity is divided by the number of output channels, which often range up to 1024. However, the speed benefits isn’t proportional to the decrease in number of operations because depthwise convolution isn’t as optimized as traditional convolution on modern hardware.

# Parameters: K×K×C_in

Computation: H×W×C_in×K×K

Use cases: Depthwise convolution is a key component for building more complex variants and convolutional blocks that are parameter and compute efficient.

Depthwise separable convolution

Depthwise convolution followed by a pointwise convolution. Since depthwise convolution has no connections between channels, we connect them with pointwise convolution. The authors of Xception find it useful to put non-linearities after depthwise convolution. The full process is illustrated in the figure below.

Spatial separable convolution separates the x and y axis in the classic convolution. In this context, depthwise separable convolution can be viewed as separating the channel dimension.

The compute complexity is marginally increased compared to plain depthwise convolution, although still much smaller than traditional convolution. However unlike plain depthwise convolution, it effectively mimics regular convolutions in many empirical experiments and used widely in modern CNN architectures.

# Parameters: (K×K+C_out)×C_in

Computation: H×W×C_in×(K×K+C_out)

Use cases: Xception, MobileNet V1/V2, EfficientNet V1(MnasNet)/V2 and so much more…

You can find the complicated history of depthwise separable convolutions in section 2 of: Xception: Deep Learning with Depthwise Separable Convolutions

In this post, we reviewed a list of convolution variants that were proposed to replace the traditional convolution layer in certain situations. These blocks each have their strengths and weaknesses, and are utilized to solve different problems. In a successive post, we will review convolutional designs that will further enhance our toolbox for creating CNN architectures.

Please tell me about suggestions or questions in the comments. I will try to respond to all of you in at most 2 days.

The amazing images(animations) are provided by vdumoulin under the MIT license (free of charge under description of license!).