EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

7 min readJun 7, 2019

--

Since AlexNet won the 2012 ImageNet competition, CNNs (short for Convolutional Neural Networks) have become the de facto algorithms for a wide variety of tasks in deep learning, especially for computer vision. From 2012 to date, researchers have been experimenting and trying to come up with better and better architectures to improve models accuracy on different tasks. Today, we will take a deep dive into the latest research paper, EfficientNet, which not only focuses on improving the accuracy, but also the efficiency of models.

Why does scaling even matter?

Before discussing “What the heck scaling means?”, the relevant question is: Why does scaling matter at all? Well, scaling is generally done to improve the model’s accuracy on a certain task, for example, ImageNet classification. Although sometimes researchers don’t care much about efficient models as the competition is to beat the SOTA, scaling, if done correctly, can also help in improving the efficiency of a model.

What does scaling mean in the context of CNNs?

There are three scaling dimensions of a CNN: depth, width, and resolution. Depth simply means how deep the networks is which is equivalent to the number of layers in it. Width simply means how wide the network is. One measure of width, for example, is the number of channels in a Conv layer whereas Resolution is simply the image resolution that is being passed to a CNN. The figure below(from the paper itself) will give you a clear idea of what scaling means across different dimensions. We will discuss these in detail as well.

Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) Proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.

Depth Scaling (d):

Scaling a network by depth is the most common way of scaling. Depth can be scaled up as well as scaled down by adding/removing layers respectively. For example, ResNets can be scaled up from ResNet-50 to ResNet-200 as well as they can be scaled down from ResNet-50 to ResNet-18. But why depth scaling? The intuition is that a deeper network can capture richer and more complex features, and generalizes well on new tasks.

Fair enough. Well, let’s make our network 1000 layers deep then? We don’t mind adding extra layers if we have the resources and a chance to improve on this task.

Easier said than done! Theoretically, with more layers, the network performance should improve but practically it doesn’t follow. Vanishing gradients is one of the most common problems that arises as we go deep. Even if you avoid the gradients to vanish, as well as use some techniques to make the training smooth, adding more layers doesn’t always help. For example, ResNet-1000 has similar accuracy as ResNet-101.

Width Scaling (w):

This is commonly used when we want to keep our model small. Wider networks tend to be able to capture more fine-grained features. Also, smaller models are easier to train.

Isn’t that what we want? Small model, improved accuracy? Go on, make it wide you idiot! What is the problem now?

The problem is that even though you can make your network extremely wide, with shallow models (less deep but wider) accuracy saturates quickly with larger width.

Okay, genius! You are saying that neither we can make our network very deep nor we can it make it very wide. Fine. But can’t you just combine the above two scaling? If you didn’t get that until now, what are you good at? Machine Learning? Huh!

This is a very good question. Yes, we can do that but before we look into it, let’s first discuss the third scaling dimension as well. Combination of three can be taken as well, right?

Resolution (r):

Intuitively, we can say that in a high-resolution image, the features are more fine-grained and hence high-res images should work better. This is also one of the reasons that in complex tasks, like Object detection, we use image resolutions like 300x300, or 512x512, or 600x600. But this doesn’t scale linearly. The accuracy gain diminishes very quickly. For example, increasing resolution from 500x500 to 560x560 doesn’t yield significant improvements.

The above three points lead to our first observation: Scaling up any dimension of network (width, depth or resolution) improves accuracy, but the accuracy gain diminishes for bigger models.

Scaling Up a Baseline Model with Different Network Width (w), Depth (d), and Resolution (r) Coefficients. Bigger networks with larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturate after reaching 80%, demonstrating the limitation of single dimension scaling.

What about combined scaling?

Yes, we can combine the scalings for different dimensions but there are some points that the authors have made:

Though it is possible to scale two or three dimensions arbitrarily, arbitrary scaling is a tedious task.
Most of the times, manual scaling results in sub-optimal accuracy and efficiency.

Intuition says that as the resolution of the images is increased, depth and width of the network should be increased as well. As the depth is increased, larger receptive fields can capture similar features that include more pixels in an image. Also, as the width is increased, more fine-grained features will be captured. To validate this intuition, the authors ran a number of experiments with different scaling values for each dimension. For example, as shown in the figure below from the paper, with deeper and higher resolution, width scaling achieves much better accuracy under the same FLOPS cost.

Scaling Network Width for Different Baseline Networks. Each dot in a line denotes a model with different width coefficient (w). All baseline networks are from Table 1. The first baseline network (d=1.0, r=1.0) has 18 convolutional layers with resolution 224x224, while the last baseline (d=2.0, r=1.3) has 36 layers with resolution 299x299

These results lead to our second observation: It is critical to balance all dimensions of a network (width, depth, and resolution) during CNNs scaling for getting improved accuracy and efficiency.

Proposed Compound Scaling

The authors proposed a simple yet very effective scaling technique which uses a compound coefficient ɸ to uniformly scale network width, depth, and resolution in a principled way:

ɸ is a user-specified coefficient that controls how many resources are available whereas α, β, and γ specify how to assign these resources to network depth, width, and resolution respectively.

Aye aye researcher! But tell me two things: First, why not alpha squared as well? Second, why constraining the product of these three to 2?

An excellent question. In a CNN, Conv layers are the most compute expensive part of the network. Also, FLOPS of a regular convolution op is almost proportional to d, w², r², i.e. doubling the depth will double the FLOPS while doubling width or resolution increases FLOPS almost by four times. Hence, in order to make sure that the total FLOPS don’t exceed 2^ϕ, the constraint applied is that (α * β² * γ²) ≈ 2

EfficientNet Architecture

Scaling doesn’t change the layer operations, hence it is better to first have a good baseline network and then scale it along different dimensions using the proposed compound scaling. The authors obtained their base network by doing a Neural Architecture Search (NAS) that optimizes for both accuracy and FLOPS. The architecture is similar to M-NASNet as it has been found using the similar search space. The network layers/blocks are as shown below:

The MBConv block is nothing fancy but an Inverted Residual Block (used in MobileNetV2) with a Squeeze and Excite block injected sometimes.

Now we have the base network, we can search for optimal values for our scaling parameters. If you revisit the equation, you will quickly realize that we have a total of four parameters to search for: α, β, γ, and ϕ. In order to make the search space smaller and making the search operation less costly, the search for these parameters can be completed in two steps.

Fix ϕ =1, assuming that twice more resources are available, and do a small grid search for α, β, and γ. For baseline network B0, it turned out the optimal values are α =1.2, β = 1.1, and γ = 1.15 such that α * β² * γ² ≈ 2
Now fix α, β, and γ as constants (with values found in above step) and experiment with different values of ϕ. The different values of ϕ produce EfficientNets B1-B7.

Conclusion

This is probably one of the best papers in 2019 that I have read so far. The paper not only opens new doors to focus on searching for more accurate nets, but it also emphasizes on finding architectures that are much efficient.

Although earlier researches have been done in this direction where researchers have tried to come up with architectures which tried to reduce the number of parameters and FLOPS so that they can run on mobiles and edge devices, for example, MobileNets, ShuffleNets, M-NasNet, etc, it is the first time where we have seen huge gains in parameter reduction and FLOPS cost along with huge gain in accuracy.

References

EfficientNet paper: https://arxiv.org/abs/1905.11946
Official released code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet