Assembled CNN: Trick-o-Tweaks

An amalgamation of Techniques in a Convolutional Neural Network for Performance Improvement

Mahima Modi
VisionWizard
9 min readJul 25, 2020

--

Ever wondered what happens when all the best techniques and tricks are bagged into one single network? Will it make any difference or worsen the network? Let’s find out, in this article.

In this article, we are going to discuss one such paper Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network[1].

The neural network architecture proposed by the paper has canned in different techniques to improve classification at various levels.

Table of Content

  1. Introduction
  2. Assembling CNN
  3. Experiment Results
  4. Conclusion
  5. References

Introduction

  • Most of the papers focus on only one domain for improvement. Either it is network architecture tweaking or regularisation techniques. This paper discusses both CNN related techniques: network tweaks and regularization.
  • Network tweaks contain network architecture(ResNet-D), channel attention modules(SE and SK), down-sampling modules(anti-aliasing), and skip connection modules(Big little Network).
  • Regularization tricks like auto-augment, Label smoothing, DropBlock, mixup, and Knowledge Distillation are discussed in this blog.
Topics we’ll dig into in this article.

Assembling CNN

In this section, we will discuss briefly tweaks that are applied to simple vanilla ResNet architecture and its blocks.

Source[Link]

The above figure gives an idea and makes us visualize what the architecture contains. We’ll be dissecting and understanding briefly about all the pieces in this section.

ResNet-D

Downsampling block architecture in the proposed classification network is inspired by ResNet-D[2] architecture.

The Left is about Vanilla ResNet-50 network and Right shows ResNet-D downsampling block modification; Source[Link]

It is a vanilla ResNet(as shown in the above-left figure) structure with three changes that are added to the ResNet model:

  1. Stride sizes of the first two convolutions in the residual path have been switched.
  2. Average pooling layer 2x2 filter size and stride of 2 is added before the convolution in the skip connection path.
  3. Stem layer kernel size(7x7) is replaced to 3x3 kernel size

Squeeze and Excitation — — Attention Channel

This technique is used to increase sensitivity toward informative content and supress noise(less useful information)

Squeeze and excitation Source[Link]

The method is performed in two-stages:

(i) Squeeze step aims at truncating the issue of missing contextual information from filter operator feature maps. This is achieved by using global average pooling to generate channel-wise statistics

(ii) Extinction step uses information captured in squeeze step by fully capturing channel-wise dependencies. It is achieved by the F_ex(.). Where F_ex in simple terms is FC → ReLU → FC →Sigmoid. (FC: fully connected layer)

Source[Link]

The above figure is a simple implementation of SE module on Vanilla ResNet. Global Pooling performs the Squeeze step while following FC to Sigmoid performs the Extinction step.

Selective Kernel — — Attention Channel

A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches.

Source[Link]

Selective Kernel (SK )convolution contains three operators — Split, Fuse, and Select, as shown in Fig above.

→ Split: The just split input feature maps into two and perform transformation operation on both with two different kernels (5x5 and 3x3)

Paper instead uses two 3 × 3 convolutions to split the given feature map (as shown in below image) beacause it reduces inference cost.

→Fuse: The basic idea is to use gates to control the information flows from multiple branches carrying different scales of information into neurons in the next layer.

Fuse from the above figure can b\

e observed that it contains a chain of functions: Fgp(Global average pooling) → Ffc(FC →Batch Normalisation→ ReLU).

d: output channels that FC has
r: reduction ratio
C input channels
L: minimal value

→Select: A soft attention across channels is used to adaptively select different spatial scales of information, which is guided by the compact feature descriptor. This is performed by simple softmax.

Modified Selective kernel; Source[Link]

Anti-Aliasing

Downsampling strategies are most important in CNN as they decide which information will get truncated and which not. Anti-aliasing is the latest downsampling strategy.

Shit invariance is when our model is vulnerable to simple shifting in images. The root cause of this is pooling. Hence anti-aliasing is used to improve shift-equivariance of deep networks.

Anti-aliasing makes sure to blur the input followed by max-pooling as shown in below image

Source[Link]

Two more pooling types are there average pooling and strived-conv pooling:

  • This analogous modification applies conceptually to any strided layer. Just need to replace ReLU →Conv(k,s) to BlurPool(m,s) →Relu →Conv(k,1), Where k:kernel-size,s:stride, m: blurring kernel
  • Blurred down-sampling with a box filter is the same as average pooling just replace AvgPool(k,s) to BlurPool(m,s)

The to visuals the effect of shifting attack on classic CNN is given in left (as shown in the figure given below)bar graph and comparatively how anti-aliased CNN is resistive to such attack is visible in right(as shown in the figure given below)

Source[Link]

Big Little Network

Big-Little Net (bL-Net) for efficient multi-scale feature representations.
The bL-Net stacks several Big-Little Modules. A bL-module include K branches (K = 2 in Fig given below) where the kth branch represents an image scale of 1/2 k. ‘M’ here denotes a merging operation.

Source[Link]

The module includes two branches as is seen in the figure above. The Big-Branch has the same structure as the baseline model while the Little-Branch reduces the convolutional layers and feature maps by α and β, respectively.

Larger values of α and β lead to lower computational complexity in Big-Little Net.

The paper[1] uses α = 2 and β = 4 for ResNet-50 and use α = 1 and β = 2 for ResNet-152.

Regularization

AutoAugment

Uses reinforcement learning to select a sequence of image augmentation operations with the best accuracy by searching a discrete search space of their probability of application and magnitude.

Auto augmentation ImageNet Results; Source[Link]

Label Smoothing

A CNN for the classification problem is trained to minimize cross-entropy with this one hot encoding target where the logits of the last fully connected layer of CNN grow to infinity, which leads to over-fitting.

Label smoothing suppresses infinite output and prevents over-fitting. We set the label smoothing factor to 0.1

Mixup

A dataset augmentation method where two samples are interpolated to get one sample. This method is used to generalize a neural network. Mixup is of two types:

  1. Using two mini-batches create a mixed mini-batch
  2. Using a single mini-batch create the mixed mini-batch by mixing the single mini-batch with a shuffled clone of itself.
Source[Link]

The experiment performed using both types of mixups and type 1 is clearly a winner.

DropBlock

It is a regularisation technique using for adding noise or rather eliminating of information. The DropBlock technique discards a block of a contiguous correlated area.

Source[Link]

Paper applies DropBlock to Stage 3 and 4 of ResNet-50 and linearly decay the keep_prob hyperparameter from 1.0 to 0.9 during training.

Knowledge Distillation (KD)

This method requires a teacher network for learning student network. The teacher network has high accuracy so as it can transfer knowledge to weakly trained student networks. It

Source[Link]

By increasing the T temperature we soften the probability output. Hence in [4], it is suggested that we keep T = 2,3.

However, in the proposed model[1] T=1 because the teacher’s signal itself, is already smoothed by the Mixup. Paper uses EfficientNet as parent for Knowledge Distillation

Experiment Results

Channel Attention

The table below shows the results for different configurations of channel attention.

Source[Link]
  • Compared with SK, SE has higher throughput but lower accuracy
  • Between C3 and C2, the top-1 accuracy only differs by 0.08% (78.00% and 77.92%), but the throughput is significantly different (326 and 382).
  • Comparing C3 and C4, we see that changing the setting of reduction ratio r for SK units from 2 to 16 yields a large degradation of top-1 accuracy relative to the improvement of throughput.
  • Applying both SE and SK (C5) not only decreases accuracy by 0.42% (from 77.92% to 77.50%) but also decreases inference throughput by 37

Anti-Aliasing

The table below shows how does AA affects differently for different combinations of max-pooling, projection-conv, and strided-conv.

Source[Link]
  • By reducing the filter size from 5 to 3 maintains top-1 accuracy while increasing inference throughput
  • A3 gives us insights that if projection-conv is not AA it does not decrease accuracy but increases throughput significantly.
  • Based on the result, we apply AA only to strided-conv in our model

Regularization

Source[Link]
  • If you see figure x the top accuracy has improved tremendously regularisation from 77.56 to 81.40
  • To go one step beyond KD with T = 1 had added to accuracy with 0.29 Top-1 increase.

Ablation study

Combinations of above-discussed tricks and techniques to get the most out of our model.

Source[Link]
  • With regularization, SK nearly doubles the accuracy improvement (E5 and E6).
  • BL shows a performance improvement not only on top-1 but also on mCE and mFR without inference throughput loss (E9)
  • Changing the epochs from 270 to 600 improves performance (E8). Because data augmentation and regularization are stacked, they have a stronger effect of regularization, so longer epochs seem to yield better generalization performance.

This final model (E11) : R50D+BL+SK+AA with regularisation LS+MixUP+DropBlock+Autoaugment

Conclusion

  • Network Tweaking focuses on both the bigger picture of architecture as well as small blocks. For example ResNet-D architecture and BL-Net as a whole while changing its downsampling block to anti-aliasing and selective kernel at filter levels.
  • Channel attention SK is much better accuracy wise. But to increase throughput r is changed from 2 to 16 compromising on accuracy.
  • Anti-Aliasing downsampling technique is used as downsampling for max-pooling and strided-conv compromises throughput but improves accuracy. Hence a trade-off is done of using AA in strided-conv only to increase throughput.
  • Big Little network has added to accuracy without throughput compromise. A star performer !!!!!!
  • Regularization techniques made the model more generalized!

References

[1] Assembled CNN : [Paper], [GitHub]

[2] ResNet-D: [Paper] , [GitHub]

[3] Squeeze and Excitation: [Paper]

[4] Selective Kernel: [Paper]

[5] Anti-Aliasing: [Paper] , [GitHub]

[6] Big Little Network: [Paper], [GitHub]

[7] Auto-Augmentaion: [Paper] , [GitHub]

[8] MixUp: [Paper] , [GitHub]

[9] DropBlock: [Paper] , [GitHub]

[10] Knowledge Distillation: [Paper] , [GitHub]

--

--

Mahima Modi
VisionWizard

Machine Learning || Deep Learning || Computer Vision Enthusiast