EfficientNetV2: Smaller Models and Faster Training

Published in

CodeX

5 min readSep 16, 2021

There have been many experiments on the search for the most efficient CNN architecture. An efficient architecture must find a perfect balance between many bottlenecks such as accuracy, parameters, FLOPs, or inference time. Many papers focus on different bottlenecks according to their needs. For example, DenseNet and EfficientNet try to improve accuracy with fewer parameters. RegNet, ResNeSt, and MNasNet optimize inference speed. NFNets focus on improving training speed.

Training efficient + Parameter efficient

This brand new paper(June 2021) provides great insights on how to train CNNs both parameter-efficiently and improve training speed. Focusing on discussing and improving the following breakthroughs in training CNNs:

Neural architecture search (NAS): Use of random search/reinforcement learning to make optimal model design choices and find hyperparameters.
Scaling strategies: Guidelines on how to upscale small networks into bigger ones effectively, e.g. compound scaling rule of EfficientNet.
Training strategies: e.g. new regularization methods, guidelines for training efficiency.
Progressive learning: Accelerating training by progressively increasing the image size.
Various types of convolutions and building blocks: e.g. depthwise conv, depthwise-separable conv, squeeze and excitation(SE), MB Conv, Fused-MB Conv.

EfficientNetV2 discusses inefficiencies of the EfficientNet pipeline and refines each component with a new strategy. In this post, we will discuss the problems of EfficientNet and provide the solutions proposed in this paper.

What is wrong with EfficientNet (a.k.a EfficientNetV1)?

The EfficientNet pipeline searches a network EfficientNet-B0 that optimizes FLOPs (~speed) and accuracy using NAS. A compound scaling rule is applied to scale up this network by increasing depth(# layers), width(# channels), and image size together and finds EfficientNet B1-B7. The paper discusses three bottlenecks of EfficientNetV1.

Training with very large image sizes is slow: EfficientNet exponentially scales the input image resolution (e.g. B7 inputs images of 600×600). This causes significant memory bottlenecks and forces smaller batch sizes, which rather harms the performance.

Depthwise convolutions are slow in early layers but effective in later stages: The MBConv module with depthwise convolution and Fused-MBConv module without depthwise convolution is considered as building blocks in EfficientNetV2. Depthwise convolution has significantly fewer parameters but is slower because it doesn’t utilize modern accelerators.

Since the parameter benefit from early layers are not so big because # channels are relatively small, an appropriate mix of regular MBConv blocks and Fused-MBConv blocks could be optimal(Table 3).

Equally scaling up every stage is sub-optimal: Naively scaling all stages of the network via the compound scaling rule is not the best idea. Intuitively, each stage will not equally contribute to training speed and parameter efficiency.

Improved NAS and Scaling

As a solution to the previously discussed issues of EfficientNetV1, the paper proposes an improved NAS algorithm and scaling strategy.

NAS

The newly defined search space is a simplified version defined of:

convolutional operation types: {MBConv, Fused-MBConv}
number of layer
kernel size: {3×3, 5×5}
expansion ratio(inside MBConv): {1, 4, 6}

As we can see, the search space is quite narrow because most unnecessary search options of EfficientNetV1 are removed. Also, the optimal channel sizes are adopted from the parameters already searched in EfficientNetV1.

The new search reward combines the model accuracy A, # training step S, and # parameters P, using a simple weighted product A×(S^w)×(P^v ). For detail, w=-0.07 and v=-0.05 are found empirically.

Result

The table above describes the baseline architecture of EffNetV2-S searched using the modified NAS algorithm. We observe that

Early layers find Fused-MBConv more efficient while the latter prefer the original MBConv block.
Early layers prefer smaller expansion ration for MBConv.
Every layer prefers 3×3 kernels with more layers over 5×5 kernels, which are slightly utilized in EfficientNetV1.
The final stage(stage 7) is completely removed perhaps due to its large parameter size.

Scaling strategy

We make two direct modifications to the scaling strategy of EfficientNetV1.

Restrict the maximum inference image size to 480.
Add more layers to later stages to increase the network capacity without adding much runtime overhead.

These two slight modification brings a significant change even to the performance trade-off of EfficientNetV1. The black Pareto curve is improved to the grey curve, just by modifying the scaling strategies. Larger counterparts EffNetV2-M and EffNetV2-L are defined using this modified scaling strategy.

Progressive Learning

Progressive learning is a strategy to boost training by progressively increasing the model capacity a.k.a image size. Precisely, the image size is progressively increased during training. However, they often cause a drop in final accuracy.

EfficientNetV2 improves progressive learning by increasing the magnitude of regularization together with the image size. Intuitively, training with small image size=small network capacity thus needs weak regularization, and training with large image size needs stronger regularization to combat overfitting due to increased network capacity.

According to the algorithm, we increment the image size S_i and regularization magnitude φ_i simultaneously according to the value of i. The regularization magnitude φ_i controls dropout rate, RandAugment magnitude, and Mixup rate.

Summary

The paper identifies inefficiencies in the architecture and scaling strategies of the original EfficientNet. The modified NAS algorithm utilizes the prior knowledge of EfficientNetV1 and enables the adaptive search of efficient blocks and important hyperparameters. The searched architecture conveyed consistent and significant patterns that provide insights about efficient CNN architectures, which should be utilized when designing network architectures.

The compound scaling strategy is slightly modified for parameter and memory efficiency.

The modified progressive learning algorithm coupled with improved network architecture boosts the training speed of EfficientNetV2 by 11x compared to the baseline V1 network when achieving similar accuracies with the same computing power.