Mysterious EfficientNets

Rethinking Model Scaling for Convolutional Neural Networks

Akshay Shah
7 min readNov 25, 2022

1. Introduction

The research paper Rethinking Model Scaling for Convolutional Neural Networks introduces a family of EfficientNet architecture. In the paper, they systematically study model scaling and identify that carefully balancing network depth, width and resolution can lead to better results. The paper introduces a compound scaling method that uniformly scales all dimensions of depth, width and resolution. I feel that it has covered a missing piece in designing CNN architectures.

2. Compound Model Scaling

The paper states that the key idea is to balance all dimensions and it can be achieved by scaling each of them with constant ratio. If we can increase the computation resources by 2^N, then we simply increase the depth by (alpha)^N, width by (beta)^N and resolution by (gamma)^N, and alpha, beta and gamma can be found by grid search on baseline network. If the input size is large then network needs more layers to increase the receptive field and more channels to capture more fine grained patterns on bigger images. The paper uses neural architecture search to create baseline model. They put constraints on memory size and flops and maximize the accuracy

Compound Formulation

Previously, several works have scaled in depth, width and resolution dimensions. ResNets have scaled on depth, having depth from 32 to 1000. Deeper networks can capture richer and complex features, however they are difficult to train. Scaling width is commonly used for small size models. They can capture fine grained features and are easier to train. WideResNets have tried to improve the performance by scaling wide. Scaling resolution can potentially capture more fine grained features. Many recent implementations use resolutions like 299x299 and 331x331 instead of 224x224. It can be noted that scaling up any dimension like width, depth and resolution increases the model accuracy but the accuracy diminishes for bigger models. The paper carries out few experiments by keeping depth and resolution constant and varying width. It can be seen that depth, width and resolution are correlated with each other and hence balancing depth, width and resolution is main crux. The proposed new compound scaling method use a compound coefficient φ to uniformly scales network width, depth, and resolution in a principled way:

Compound Scaling

The paper optimizes ACC(m)×[F LOP S(m)/T ] w as the optimization where ACC(m) and F LOP S(m) denote the accuracy and FLOPS of model m, T is the target FLOPS and w=-0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOPS. They fix phi = 1 and then do small grid search for alpha,beta and gamma. The values found after search are alpha = 1.2, beta = 1.1 and gamma=1.15. These values are used to create EfficientNet-B0. We then fix the values of alpha,beta and gamma and we scale up baseline model with different phi to obtain EfficientNet-B0 to EfficientNet-B7.

3. Training Methodology

The EfficientNet models were trained on ImageNet using RMSProp optimizer with decay 0.9 and momentum 0.9, batch norm momentum 0.99, weight decay 1e-5, initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. We also use SiLU (Swish-1) activation, AutoAugment and stochastic depth with survival probability 0.8. These methods will be discussed in details in later sections. They linearly increase dropout ratio from 0.2 for EfficientNet-B0 to 0.5 for B7. The EfficientNet became state of the art algorithm and outperformed all other types of scaling.

Results

4. SiLU(Swish (β = 1)) Activation

Swish activation was introduce in the paper SEARCHING FOR ACTIVATION FUNCTIONS by Prajit Ramachandran. The paper uses Exhaustive search and reinforcement learning algorithms to search effective activation function. For such a large search spaces, they use RNN controllers which predicts single component of activation function. The predicted component is fed to RNN controller in next timestep and the process is repeated until every component of activation function is predicted.

RNN Controller

Using the candidate activation function, a child network is trained on image classification problem and accuracy is used as reward. To further speed the process, batch of activation function is added to the queue. The workers pull functions from queue and train child network and reports validation accuracy. Various unary and binary functions were used. Complicated activation functions consistently under-perform simpler activation functions, potentially due to an increased difficulty in optimization. The best performing activation functions can be represented by 1 or 2 core units. A common structure shared by the top activation functions is the use of the raw pre-activation x as input to the final binary function. The use of periodic functions in activation functions has only been briefly explored in prior work, so these discovered functions suggest a fruitful route for further research. Functions that use division tend to perform poorly because the output explodes when the denominator is near 0. Division is successful only when functions in the denominator are either bounded away from 0, such as cosh(x), or approach 0 only when the numerator also approaches 0, producing an output of 1.

Functions Used

They found that SWISH activation function ,f (x) = x · σ(βx), consistently outperformed ReLU and other activation functions. Here β is trainable parameter. Like ReLU, Swish is unbounded above and bounded below. Unlike ReLU, Swish is smooth and non-monotonic. In fact, the non-monotonicity property of Swish distinguishes itself from most common activation functions. The success of Swish with β = 1 implies that the gradient preserving property of ReLU (i.e., having a derivative of 1 when x > 0) may no longer be a distinct advantage in modern architectures. The most striking difference between Swish and ReLU is the non-monotonic “bump” of Swish when x < 0. A large percentage of pre-activations fall inside the domain of the bump (−5 ≤ x ≤ 0), which indicates that the non-monotonic bump is an important aspect of Swish. The shape of the bump can be controlled by changing the β parameter. While fixing β = 1 is effective in practice, the experiments section shows that training β can further improve performance on some models. The trained β values are spread out between 0 and 1.5 and have a peak at β ≈ 1, suggesting that the model takes advantage of the additional flexibility of trainable β parameters.

Pre-activation values and β distribution

5. Stochastic depth

The paper Deep Networks with Stochastic Depth introduces a training technique to train very deep CNN architectures with improving their performance. The key idea is to train the model with few layers and use all layers while testing. They start with a deep network and randomly drop subsets of layers for each batch while training by bypassing them with identity functions. This reduces training time and improves test errors significantly. Deeper networks faces problems like vanishing gradients and diminishing feature reuse. Diminishing feature reuse occurs during forward propagation when features computed by early layers are washed out by repeated multiplications and convolutions, making it hard for later layers to identify and learn meaningful gradient directions. The reduction in test errors by using stochastic depth is attributed to two factors: shortening the (expected) depth during training reduces the chain of forward propagation steps and gradient computations, which strengthens the gradients especially in earlier layers during backward propagation; networks trained with stochastic depth can be interpreted as an implicit ensemble of networks of different depths.

Linear decay of survival probabilities

The stochastic depth comes with one hyper-parameter, survival probability. They use linear decay rule over the depth i.e decreasing the survival probability linearly with depth. The early layers extract low level features and hence should have more survival probabilities. The expected depth during training is calculated using expectation and it comes out to be 3L/4, where L is depth of original network. Technically, we train ResNets with an average number of 40 ResBlocks, but recover a ResNet with 54 blocks at test time. Similar to Dropout, stochastic depth can be interpreted as training an ensemble of networks, but with different depths, possibly achieving higher diversity among ensemble members than ensembling those with the same depth. This explains the improvement on test errors by stochastic depth. Different from Dropout, it makes the network shorter instead of thinner. Moreover, Dropout loses effectiveness when used in combination with Batch Normalization. They trained ResNets with 101 and 1202 layers with and without stochastic depth and achieved improvements on 1202 layer ResNet network.

6. Conclusion

The research paper Rethinking Model Scaling for Convolutional Neural Networks has introduced a novel scaling technique. They have achieved state of art by combining their scaling technique with various existing techniques. Overall, the ideas in the research paper is well though and well written.

7. References

[1]. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing Tan 1 Quoc V. Le 1

[2]. SEARCHING FOR ACTIVATION FUNCTIONS Prajit Ramachandran ∗ , Barret Zoph, Quoc V. Le

[3]. Deep Networks with Stochastic Depth Gao Huang*, Yu Sun*, Zhuang Liu † , Daniel Sedra, Kilian Q. Weinberger

--

--

Akshay Shah

Masters in Machine Learning and Data Science at USC. Alumni @IIT Guwahati