Dynamic Convolution — An Exciting Innovation over Convolution Kernels from Microsoft Research

Presenting a new design of convolution to increase the efficiency of the convolutional neural networks from Microsoft

Yesha R Shastri

Follow

Published in

VisionWizard

6 min readJul 29, 2020

--

Achieving good performance while constraining the computational burden at the same time has always been a trade-off for efficiency of the light-weight convolutional neural networks. In order to maintain this trade-off, a novel design of dynamic convolutions is proposed in the paper ‘Dynamic Convolution: Attention over Convolution Kernels’.
Let’s explore this approach further in this article.

1. Introduction

On the application side, efficient light-weight CNNs are needed, especially for mobile devices that require the enabling of various features and producing output in real-time.

1.1 The motivation for Dynamic Convolution

Previous approaches of convolutions had static convolutional kernels (not adaptive to input) but dynamic network architecture (increasing parameters, number of layers, number of channels, etc.).
In contrast, dynamic convolution presents dynamic convolutional kernels (adaptive to input [conv = f (x)] ) with static network architecture (depth and width constant).

Figure 1: The proposed method of dynamic convolution [1] is a function of the input in contrast to static convolution. (Source: [2])

The constraints on computational cost play a significant role in determining the accuracy of the network since increasing the network depth and width, i.e. computational cost, will directly increase the network efficiency.

For instance, the computational cost of MobileNetV3 reduces from 219M to 66M Multi-Adds, the top1 accuracy of ImageNet classiﬁcation drops from 75.2% to 67.4%. [1]

1.2 The Proposed Convolution Design

Initially, there was one convolution kernel per layer (W,b). The new operator design proposed by [1] uses K parallel convolution kernels (Wk, bk) aggregated dynamically for each input x.

Figure 2: Aggregated weight matrix based on k kernels. (Source: [1])

Figure 3: Aggregated bias vector based on k kernels. (Source: [1])

The aggregation is termed as “dynamic,” as the kernels are combined distinctly for different input images by applying input dependent attention (πk(x)).
Higher representation power for the dynamic convolution comes from the non-linear function of aggregating the kernels with attention.

1.3 How is dynamic convolution effective?

The aggregation of parallel convolution kernels makes them share the same output channels, hence the network width or depth is not increased.
The computation of attention weights and aggregation of the kernels imposes some extra computational cost. However, this induced cost is negligible as compared to the convolution operation.

Therefore, the complexity and representation capacity of the model is boosted only by inculcating a reasonable cost of model size.

2. Dynamic Convolution Neural Networks

2.1 Defining a Dynamic Perceptron

A dynamic perceptron is formed by aggregating the multiple k linear functions (a combination of weight matrix ‘W_bar’ and bias vector ‘b_bar’) and passing it to a non-linear activation function ‘g’ like ‘ReLU.’

For equations of ‘W_bar’ and ‘b_bar’ see figures (2) and (3).

Figure 4: Illustration of the formation of a dynamic perceptron (Source: [1])

The final equation of a dynamic perceptron is given as:

Figure 5: The dynamic perceptron (Source: [1])

2.2 Dynamic Convolution

For computing the kernel attentions, the squeeze-and-excitation method is used.
First, the global average pooling layer squeezes the spatial information.
Next, to generate the normalized attention weights for the convolution kernels, the input is further passed through two fully connected (FC) layers with a ReLU after the first FC layer and softmax after the second.
Once the aggregated convolution completes after the attentions are computed, the output is passed through a batch normalization layer followed by an activation function (ReLU).
The mentioned procedure builds a dynamic convolution layer.

Figure 6: A dynamic convolution layer (Source: [1])

3. Key Insights for the Proposed Model

Training a deep dynamic convolutional neural network is a challenging task as it requires joint optimization of all the convolutional kernels and the attentions from several layers.

Challenges come from the following reasons:

Learning an attention model is a tough task since the output space would be very large.

Figure 7: Large output space for learning attention model (Source: [2])

2. Further, as the network gets deeper, learning multiple kernels across different layers becomes even more difficult.

Figure 8: Network gets deeper for multiple layers (Source: [2])

Hence, to ease the training by increasing the efficiency of joint optimization, two key insights are proposed for the model.

3.1 Constraining the Output Kernel Space

The first critical insight deals with constraining the output space. The constraint 0 ≤ πk(x) ≤ 1 for the attention value (πk(x)) will compress the output space of the aggregated kernel to two pyramids, as shown in the figure below.

Figure 9: Compressed kernel space to two pyramids (Source: [2])

In continuation, the sum of attention values is constrained to one. This will further restrict the output space to a triangle (indicated by the blue shaded region in figure 10). The red line shown below is compressed to a point by normalizing the attention sum.

Figure 10: Illustration of sum the attention to one constraint (Source: [2])

The softmax activation function is taken for producing the sum-to-one constraint.
This method of normalization helps ease the learning of πk(x) when it is trained by jointly optimizing with the aggregated kernel in a deep network.

3.2 Near Uniform Attention in Early Training Epochs

The use of softmax does not allow the uniform learning of all the convolutional kernels simultaneously and slows the convergence in training because of its one-hot output.
Therefore, to make attention uniform in the early training epochs, the temperature τ is increased in the softmax to flatten the attention [formula is given below].

Figure 11: Attention formula using softmax with temperature τ and output of the second FC layer in attention branch zk (Source: [1])

As the temperature τ is increased from 1 to 30, the training becomes more efficient by converging much faster for τ=30 than for τ=1.

Figure 12: Comparison of training convergence for τ=1 and τ=30 (Source: [1])

For detailed experimentation details and results, refer to the original paper [1].

4. Compatibility with Existing CNN Architectures

The proposed dynamic convolution by [1] can be used to replace any of the static convolution kernels like 1×1, 3×3, group, or depthwise convolution.
Additionally, the method can also be used in advanced architectures found by Neural Architecture Search (NAS). Dynamic convolution helps to improve the performance of both human-generated (for instance, MobileNetV2) and automatically searched (for example, MobileNetV3) network architectures.

5. Conclusion

Dynamic convolution [1] is a novel operator design that increases model complexity without changing the network depth or width by aggregating multiple kernels dynamically based on the attentions dependent on the input.
It is flexible enough to get placed into the existing CNN architectures and can further help to improve their accuracy.
By replacing the static convolution with dynamic convolution in MobileNetV2 and MobileNetV3, the method gains a top-1 accuracy improvement of 4.5% and 2.9% respectively on a 100M Multi-Adds budget with ~4% increase in the computational cost. [see figure below]

Figure 13: The trade-off between computational cost (MAdds) and top-1 accuracy of ImageNet classiﬁcation (Source: [1])

Finally, this method achieves significant improvement in accuracy with fewer kernels per layer, smaller model size, and negligible extra computational cost.

6. References

[1] Chen, Yinpeng, et al. “Dynamic convolution: Attention over convolution kernels.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[2] https://www.youtube.com/watch?v=FNkY7I2R_zM