MixConv — Mixed Depthwise Convolutional Kernels from Google Brain

An understanding of a new paradigm of depthwise convolution operation developed by Google Research Team

Published in

VisionWizard

7 min readJun 28, 2020

Photo by Devon Janse van Rensburg on Unsplash

Convolutional Neural Networks are complex computational models. Deeper the model, higher will be the complexity. Due to this unfortunate property, it becomes very nontrivial to use these models for real-time purposes.

First released in the paper Xception: Deep Learning with Depthwise Separable Convolutions by Google, introduced the concept of Depthwise Separation Convolutional Kernels, which helped to accelerate the speed of a convolution operation. It proved to be one of the crucial factors in getting efficient modern ConvNets and deploying them onto real-time edge low-compute devices.

Recently, Google released a new paradigm of Depthwise Convolutional Kernels in the paper MixConv: Mixed Depthwise Convolutional Kernels[1].

In this article, we will get a detailed overview of this new convolutional operation and introduction to the new mobile ConvNet Family MixNets as mentioned in the paper[1]

A Brief Revisit on Depthwise Seprable Convolutional Kernels

Let’s quickly revise the concept of depthwise separable convolutional kernels first.

Here, as shown in the figure, a vanilla convolution operation is done between a 5x10x10 feature map(Cyan blue color)and a 3x3 kernel(Red color). Consider padding and stride = 1, and the number of output channels = 64.
In standard convolution, one 3D 5x3x3 kernel is convolved with a whole feature map to get 1x10x10 output(Blue color sheet). The number of multiplications at every 3D convolution is 4500(3*3*5*10*10). After 64 such 3D convolution operations, each output is stacked with each other to get a 64x10x10 output feature map. Total complexity after a whole convolution operation is 4500 * 64 = 288000 FP32 multiplications .

Fig. 2 Depthwise Separable Convolution Operations.

Now consider the above scenario. Depthwise Separable Convolution operation divides the standard convolution into two parts: Depthwise Convolution and Pointwise Convolution.

Depthwise Convolution

Each 2D 3x3 filter is applied at different channel indices of the input feature map to generate the separate spatial 2D maps, which are further stacked with each other to create an intermediate transformed output(Yellow color in Fig. 2).

Here as you can see, the depthwise convolution layers do not increase number of channels in the output feature map unlike standard convolution. The number of kernels used for the operation equals number of input channels of the feature map. In our case the number of kernels = 5.

The number of multiplications after the completion of depthwise convolution operation equals 4500(5*3*3*10*10).
There is a concept of depth multiplier m in depthwise convolutions. At every depthwise convolution operation, it outputs C*m channels.

For example, if we keep depth multiplier = 2, in our case, final intermediate output size(yellow color) will become (10, 10, 10) as each depthwise kernel convolution will give us the output of (2, 10, 10) instead of just 2D spatial map.

Pointwise Convolution

Pointwise convolution increases the number of channels of the intermediate generated feature map(Yellow Color 5x10x10 → 64x10x10).
It uses 3D 5x1x1 kernels, which are convolved with every point of the feature map to get the resultant output.
The number of kernels used equals the number of output channels. In our case, it is 64.
The total number of multiplications after the final pointwise convolution operation is 32000(64*1*1*5*10*10).

Complexity in depthwise separable convolution operation is 32000 + 4500 = 36500 FP32 multiplications which are way less than FP32 288000. This property of the operation helps to achieve the same output at a very low computation power and therefore are being one of the most famous convolution operations you will find in the present day Mobile Architectures like MobileNets, ShuffleNets, etc.

While designing ConvNets with depthwise convolutional kernels, an essential but often overlooked factor is kernel size. Although the conventional practice is to use 3x3 kernels simply, recent research results have shown larger kernel sizes such as 5x5 kernels and 7x7 kernels can potentially improve model accuracy and efficiency[1].

MixConv — A new paradigm of Depthwise Separable Convolutional Kernels

MixConv method is very similar to the depthwise separable convolution kernels. But, instead of applying fixed size kxk kernels to individual spatial channels of the feature map, they partition the channels into groups and use the convolution operation by varying the kernel sizes on each group — Visual representation of the same given below.

Fig. 3 Difference between Vanilla(Left) Depthwise Convolution Operation and MixConv(Right)

By using different large sized kernels, tends to increase the receptive field of the network, which further increases the performance of the model on classification/detection tasks.

Step 1: It divides Input feature map (C, H, W) into g groups of different channels C1, C2, …, Cg, etc. Depthwise convolution(Denoted by *) is done on these small feature maps with varying kernels of shape, as shown in Fig. 3.
Step 2: After the concatenation of the generated feature maps, final Pointwise Convolution(Denoted By X) is used to increase the number of channels in the feature map (Co, H, W).

On certain platforms, MixConv could be implemented as a single op and optimized with group convolution. Nevertheless, as shown in the figure, MixConv can be considered as a simple drop-in replacement of vanilla depthwise convolution[1].

Snippet: Implementation of the MixConv in Tensorflow[1]

An easy and straightforward structure increases the possibility of the design choices to implement MixConv.

GROUP SIZE g:

Group size is used to determine the number of different kernels to be taken into consideration. Each group is convolved with different sized kernels in depthwise convolution, as shown in Fig. 4.
If we choose g = 1, MixConv becomes equivalent to vanilla depthwise convolution operation.
Authors, as mentioned in the paper, use different group sizes for every layer to get the best performance — usually, the sample space of group size is g ∈ [1, 5].

SIZE OF KERNEL PER GROUP

Different sized kernels are used for each group gi. The authors proposed a simple method to allocate the kernel size to each group.
For each group indexed with i will be allocated 2 * i + 1 kernel size(NOTE: i starts from 1).

For example, a 4-group MixConv always uses kernel sizes {3x3, 5x5, 7x7, 9x9}. With this restriction, the kernel size for each group is predefined for any group size g, thus simplifying our design process[1].

NUMBER OF CHANNELS PER GROUP

Paper considers two methods for channel partition.

— Equal sized partition: Each group will contain the same number of channels. For example, if the group size g=4, the number of channels per group for a 64 channel input feature map will be (16, 16, 16, 16).

— Exponential partition: The i-th group will have about 1 / 2^i portion of total channels. For example, if the group size g=4 number of channels per group for a 64 channel input feature map will be (32, 16, 8, 4).

You might be wondering about the feasibility and number of parameters of MixConv operation.
After series of study on MobileNet family architecture, they proved that by using MixConv design in present designs outperformed baseline mobile architectures on COCO Detection.

Table. 1 Comparision between MixConv and Baseline architecture in MobileNets.

MixNet — NAS on MixConv implemented MobileNet Architecture Family.

Neural Architecture Search(NAS) on network regimes have helped to get one of the best architectural designs of Convolutional Neural Networks.
Authors ran NAS on MixConv integrated MobileNet architecture to develop a new SOTA family of mobile architectures — MixNets.
They had different choices of group sizes, kernel size per group, and the number of channel allocations.

Sample space started from g=1, k = 3x3 and ended at g=5, k=3x3, 5x5, 7x7, 9x9, 11x11. They used equal channel partition regime to allocate channels per group.

They obtained three architecture designs from NAS that outperformed every existing mobile architecture design. They call them MixNet-S, MixNet-M, and MixNet-L.
MixNet-L can be obtained from MixNet-M by keeping depth multiplier m=1.3.

Fig. 5 Architecture Design of MixNet-S(Top) and MixNet-M(Bottom)

Both architectures use a variety of MixConv with different kernel sizes throughout the network: small kernels are more common in early-stage for saving computational cost, while large kernels are more common in a later stages for better accuracy. We also observe that the bigger MixNet-M tends to use more large kernels and more layers to pursing higher accuracy, with the cost of more parameters and FLOPS[1].

Fig. 6 Comparison of MixNet with different architectures on ImageNet Dataset.

Conclusion

The study of different kernel sizes for depthwise convolution was leveraged into vanilla mobile architectures such as MobileNets, ShuffleNets, etc. They propose MixConv, which mixes multiple kernels in a single operation to take advantage of different kernel sizes.

They also proposed new mobile architectural design — MixNets using neural architecture search techniques. These new models achieved significantly better accuracy on transfer learning and classification datasets — ImageNet.

Official TensorFlow Code of MixNets: [github]
Pytorch Implementation of MixConv: [github]

I hope you found the content meaningful. If you want to stay updated with the latest research in AI, please follow us at VisionWizard

References

[1] MixConv: Mixed Depthwise Convolutional Kernels, Mingxing Tan, Quoc V. Le, Google Brain, Dec’ 19.