EfficientDet: Scalable and Efficient Object Detection

Aakash Nain
12 min readNov 28, 2019

--

Object Detection has come a long way. From trivial computer vision techniques for object detection to advanced object detectors, the improvements have been amazing. Convolutional Neural Networks (CNNs) have played a huge role in this revolution. We want our detector to be as accurate as possible as well as fast enough to run in real-time. These two aspects have a trade-off and most of the detectors have proven to be doing well only on one metric, either the accuracy or the speed. Generally, more accurate detectors have found to be more compute demanding which isn’t the ideal scenario, especially when we are looking for more and more efficient models. This paper from the Google Brain team has come up with a new family of detectors that are highly efficient, accurate and much faster.

Why are we talking about efficiency?

A natural question that comes to mind is why are we even concerned about efficiency at this point. Shouldn’t we be more worried about the accuracy of the detectors? After all, we know techniques like Global Filter Pruning that can be used to compress big models.

Well, the argument is completely valid. Although you can compress the bigger models using pruning, it is a multi-stage process and there is always some amount of accuracy drop no matter how small it is. The improvements in detectors over the last few years have been incremental and we still haven’t explored the path of designing efficient object detectors from scratch. Given that EfficientNets showed how to scale CNNs, it makes sense to advance research in the same direction for object detectors as well because, in real-world applications like robotics and self-driving cars, model size, memory footprints, and latency become very important for model deployment.

Contributions of this paper

There are three major contributions of this paper:

  1. BiFPN: A weighted bidirectional feature network for easy and fast multi-scale feature fusion.
  2. Compound scaling: A new method, which jointly scales up backbone, feature network, box/class network, and resolution, in a principled way.
  3. EfficientDet: A new family of detectors with significantly better accuracy and efficiency across a wide spectrum of resource constraints.

The paper aims to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPS). It tries to address two main challenges:

  1. Efficient multi-scale feature fusion: Feature Pyramid Networks (FPN) has become the de facto for fusing multi-scale features. Some of the detectors that use FPNs are RetinaNet, PANet, NAS-FPN, etc. Most of the fusion strategies adopted in these networks don’t take into consideration the importance of filters while fusing. They sum them up without any distinction. Intuitively, not all features contribute equally to the output features. Hence, a better strategy for multi-scale fusion is required.
  2. Model scaling: Most of the previous works tend to make the backbone network bigger for improving accuracy. The authors observed that scaling up feature network and box/class prediction networks are also critical when taking into account both accuracy and efficiency. Inspired by the compound scaling in EfficientNets, the authors proposed a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network.

Let’s discuss each of these points in detail now for getting a better understanding of what is being done in the paper.

Feature Pyramid Networks (FPNs)

Before discussing BiFPN, it is important to revisit the idea of FPN and discuss the advantages and disadvantages of this approach. Although FPN isn’t a new thing, the idea of using inherent multi-scale hierarchical pyramids of feature maps in a deep CNN was first introduced in 2017 this paper.

Figure1. Feature Pyramid Network

Woah, Woah! Wait a second. Before we discuss FPNs, can you elaborate why is it even required? Why should we combine features at all?

Great question. One of the hardest problems, when it comes to object detection, is to detect objects at different scales in different scenes. Earlier people used to combine Featurized Image Pyramids (features built upon image pyramids) to tackle this problem. These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels. Although this works well, it is very expensive in terms of computation and can be used only at test time, creating a mismatch between training and testing procedures.

Only an idiot would do image pyramids in 2019. Where are my CNNs!

CNNs, on the other hand, aside from being capable of representing higher-level semantics, are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale. Though CNNs are robust and provide high semantic features, the detections from a single input scale aren’t good enough. One of the sweetest things about CNNs is that different layers compute features maps at different levels and different scales. (Check the left-hand side in Figure 1. to get a better picture of this) This in-network feature hierarchy has an inherent multiscale, pyramidal shape.

Ha! I knew it! CNNs are love. They provide a simpler solution. So, you have your so-called pyramid of features now. The problem is solved, right?

Although we have feature pyramid all the feature maps in this pyramid are at different scales and have huge semantic gaps caused by different depths of layers in the network. The high-resolution maps have low-level features that harm their representational capacity for object recognition.

Hmm. I see the problem now. High-resolution maps in a CNN have low semantic features while low-resolution maps have high semantic features and we need both in order to do a robust detection at different scales.

Exactly! FPNs achieves this by combing bottom-up pathway with top-down pathway using lateral connections. In short, high-level features are upsampled first and then combined with low-level features using a lateral connection which basically is a 1x1 convolution followed by summation. Check out the paper if you want to deep dive into more details.

Well if FPNs are doing what’s needed to be done, then what the heck is BiFPN and why do we need it at all?

BiFPN

If we are talking about modifying the FPNs, then there surely must be certain disadvantages using them directly. Let’s take an example to elaborate on this.

The conventional FPN aggregates multi-scale features in a top-down manner:

The problem with conventional FPN as shown in Figure 2(a) is that it is limited by the one-way (top-down) information flow. To address this issue, PANet adds an extra bottom-up path aggregation network, as shown in Figure 2(b). Also, there are many papers, e.g. NAS-FPN, that also studied the cross-connections for capturing better semantics. In short, the game is all about the connections for connecting low-level features to high-level features and vice-versa for capturing better semantics.

If they already used NAS for figuring out a better network topology, shouldn’t NAS-FPN be the best network already?

The problem with NAS is that it is based on Reinforcement Learning that takes 1000s of GPU/TPU hours to figure out the best connections. Also, it was found that the final cross-connections found by NAS as shown in Figure 2(c), were irregular and hard to interpret. Even after this, PANet beat both FPN and NAS-FPN in terms of accuracy but has an additional computation cost. Thus it makes sense to take PANet as the base and improve the connections for both efficiency and accuracy. To achieve optimized cross-connections, the authors proposed the following:

  1. Remove the nodes that only have one input edge. If a node has only one input edge with no feature fusion, then it will have less contribution to feature network that aims at fusing different features. This leads to a simplified PANet as shown in Figure 2(e).
  2. Add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost, as shown in Figure 2(f).
  3. Unlike PANet that only has one top-down and one bottom-up path, the authors treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion. The result of these optimizations is the new feature network named as BiFPN as shown in Figure 2(f).

When we started this discussion, you said that earlier papers treat all features equally while fusing which isn’t optimal. In the above framework, you just optimized the connections, the features are still weighted equally. Are you mad?

Weighted Feature Fusion

As I said, earlier works rescale low-res maps and then sum them up with different input features treating them equally. Intuitively, if the features are at a different resolution, the contributions of these features to the final output features would be unequal in general. Hence, we need some weighting strategy for assigning a different amount of weights to different feature maps. And if we are assigning weights, why not make them trainable and let the network learn the optimal values! The authors proposed three different weighted fusion strategies.

Unbounded Fusion

Here wᵢ is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). The authors found out that a scale can achieve comparable accuracy to other approaches with minimal computational costs. However, since the scalar weight is unbounded, it could potentially cause training instability.

Softmax-based Fusion

If we want unbounded weight values to be bounded in the range of 0–1, one of the best ways is to apply softmax and turn the values into a probability distribution where the values represent the importance of the weights. The only downside of this approach is that softmax is computationally expensive and increases the latency of the network.

Fast Normalized Fusion

We want our weights to be in the range 0–1. One way to ensure that wᵢ ≥ 0 is to apply ReLU. Now we can simply normalize the values so that the upper bound is 1. This method is extremely simple and very efficient as compared to the softmax based approach.

EfficientDet

As the name suggests, using EfficientNets as the backbone network along with BiFPNs, we get a new family of detectors called EfficientDet.

Architecture

EfficientDet detectors are single-shot detectors much like SSD and RetinaNet. The backbone networks are ImageNet pretrained EfficientNets. The proposed BiFPN serves as the feature network, which takes level 3–7 features {P3, P4, P5, P6, P7} from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network to produce object class and bounding box predictions respectively. The class and box network weights are shared across all levels of features.

Earlier you mentioned that compound scaling is used here as well. And that it scales all the things including backbone network, BiFPNs, and box/class predictors jointly. This sounds like an utterly complex setup.

Compound Scaling

We have already seen in EfficientNets that scaling all dimensions provides much better performance. We would like to do the same for our EfficientDet family models. Previous works in object detection scale only the backbone network or the FPN layers for improving accuracy. This is very limiting as we are focusing on scaling only one dimension of the detector. The authors proposed a new compound scaling method for object detection, which uses a simple compound coefficient ϕ to jointly scale-up all dimensions of the backbone network, BiFPN network, class/box network, and resolution.

Object detectors have much more scaling dimensions than image classification models, so grid search for all dimensions is very expensive. Therefore, the authors used a heuristic-based scaling approach, but still, follow the main idea of jointly scaling up all dimensions.

  1. Backbone network: Same width/depth scaling coefficients of EfficientNet-B0 to B6 are used so that ImageNet-pretrained checkpoints can be used.
  2. BiFPN network: The authors exponentially grow BiFPN width (#channels) as done in EfficientNets, but linearly increase the depth (#layers) since depth needs to be rounded to small integers.

3. Box/class prediction network: The width is kept same as the BiFPN but the depth (#layers) is linearly increased.

4. Input image resolution: Since feature level 3–7 are used in BiFPN, the input resolution must be dividable by 2⁷ = 128, so we linearly increase resolutions using equation:

Now, using equations (1), (2), and (3), and different values of ϕ , we can go from Efficient-D0 (ϕ=0) to Efficient-D6 (ϕ=6). models scaled up with ϕ ≥ 7 could not fit memory unless changing batch size or other settings. Therefore, the authors expanded D6 to D7 by only enlarging input size while keeping all other dimensions the same, such that we can use the same training settings for all models. Here is a table summarizing all these configs:

Hold on a second! Though I agree that running a grid-search is computationally very expensive when we have so many things to scale up, can you elaborate on how the authors got to equation (1), (2) and (3)?

That’s a great question. Although the authors said that they used heuristics to come up with these equations, sadly nowhere in the paper is mentioned how these values were chosen. This needs more explanation from the authors’ point of view.

Experiments

All experiments were done on the COCO dataset. Each model is trained using SGD optimizer with momentum 0.9 and weight decay 4e-5. Learning rate is first linearly increased from 0 to 0.08 in the initial 5% warm-up training steps and then annealed down using cosine decay rule. Batch Normalization is added after every convolution with batch norm decay 0.997 and epsilon 1e-4. Exponential moving average with decay 0.9998 is used. Focal loss with α = 0.25 and γ = 1.5, and aspect ratio {1/2, 1, 2} are used. The models are trained with batch size 128 on 32 TPUv3 chips. Here are the results:

We can see that EfficientDet outperforms everything in terms of efficiency!

Ablation Study

Although we have discussed the various parts of EfficientDet, we haven’t checked which part is responsible for how much gain in efficiency and accuracy.

Disentangling Backbone and BiFPN

EfficientNets are more powerful than ResNets and most of the object detectors use ResNets as their backbone. Thus it makes sense to check how much performance gain we get if we simply:

  1. Replace ResNet backbone with EfficientNet backbone.
  2. Replace FPN with BiFPN with EfficientNet as the backbone.

We can see that simply replacing ResNet backbone with EfficientNet backbone gives an instant improvement in mAP but adding BiFPN gives a huge bump in mAP when compared to the baseline. Also, EfficientNet+BiFPN has fewer parameters and a very small number of FLOPS comparatively.

BiFPN Cross-Scale Connections

Earlier in this discussion, we made an argument that cross-connections are much better and different feature maps contribute unequally to the output features, hence we also need a weighting strategy. But how much performance gain do we get with/without these strategies? Let’s take a look:

We can see that BiFPN without weights is almost equivalent to NAS-FPN but with fewer parameters and better FLOPS ratio. On the other hand, weighted BiFPN gives much better results than any other strategy.

Softmax vs Fast Normalized Fusion

We argued that softmax is computationally expensive and can introduce latency. As an alternative, we replaced it with the Fast Normalized Fusion technique but how much performance drops if we do this?

We can see that the delta in mAP is negligible if we talk about performance in production. On the other hand, Fast Normalized Fusion is almost 28–31% faster than the softmax version.

Conclusion

We saw that FPN strategies can give huge improvements in the performance of object detection. Weighted BiFPN is just another step in finding the optimal network topology. Also, EfficientNets have officially killed ResNets! This is the era of EfficientNets and to me, it seems that they are going to stay here for a long time.

--

--