Review — EfficientDet: Scalable and Efficient Object Detection

Outperforms AmoebaNet+NAS-FPN+AA, ResNet+NAS-FPN+AA, RetinaNet, Mask R-CNN, and YOLOv3

Sik-Ho Tsang
CodeX
8 min readJul 29, 2021

--

Model FLOPs vs. COCO accuracy

In this story, EfficientDet: Scalable and Efficient Object Detection, (EfficientDet), by Google Research, Brain Team, is reviewed. In this paper:

  • First, a weighted bi-directional feature pyramid network (BiFPN) is proposed, which allows easy and fast multi-scale feature fusion.
  • Then, a compound scaling method is also proposed which can uniformly scale the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.
  • Finally, with EfficientNet as backbones, a family of object detectors, EfficientDet, is formed, consistently achieve much better efficiency than prior art, as shown above.

This is a paper in 2020 CVPR with over 600 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Prior Arts for FPN
  2. Bi-directional feature pyramid network (BiFPN)
  3. Weighted BiFPN
  4. EfficientDet: Network Architecture
  5. Compound Scaling Method
  6. SOTA Comparison
  7. Ablation Study

1. Prior Arts for FPN

State-Of-The-Art FPNs
  • Given a list of multi-scale features:
  • where Pin_li represents the feature at level li.
  • The goal is to find a transformation f that can effectively aggregate different features and output a list of new features:

1.1. (a) FPN

  • FPN takes level 3–7 input features:
  • For instance, if input resolution is 640×640, then Pin_3 represents feature level 3 with resolution 80×80 (640/²³ = 80), while Pin_7 represents feature level 7 with resolution 5x5.
  • The conventional FPN aggregates multi-scale features in a top-down manner:
  • where Resize is usually a upsampling or downsampling op for resolution matching.

Conventional top-down FPN is inherently limited by the one-way information flow.

1.2. (c) NAS-FPN

  • Recently, NAS-FPN employs neural architecture search (NAS) to search for better cross-scale feature network topology.

But it requires thousands of GPU hours during search and the found network is irregular and difficult to interpret or modify, as shown in © above.

1.3. (b) PANet

  • To address the FPN issue, PANet adds an extra bottom-up path aggregation network, as shown in (b) above. And it also got a more regular network compared with NAS-FPN.

PANet achieves better accuracy than FPN and NAS-FPN, but with the cost of more parameters and computations. PANet that only has one top-down and one bottom-up path.

2. Bi-directional Feature Pyramid Network (BiFPN)

Proposed BiFPN
  • We can treat BiFPN being modified from the 1.3. (b) FPN used in PANet.
1) The nodes that only have one input edge are removed (Red Circle)
  • First, the nodes that only have one input edge are removed. The intuition is simple: if a node has only one input edge with no feature fusion, then it will have less contribution to feature network that aims at fusing different features. This leads to a simplified bidirectional network.
2) The Extra Edge (Purple Arrow)
  • Second, an extra edge is added from the original input to output node if they are at the same level, in order to fuse more features without adding much cost.
3) Repeatable for multiple times
  • Third, unlike PANet that only has one top-down and one bottom-up path, each bidirectional (top-down & bottom-up) path is treated as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion.

3. Weighted BiFPN

  • Since different input features are at different resolutions, they usually contribute to the output feature unequally.
  • To address this issue, an additional weight is added for each input, and let the network to learn the importance of each input feature.
  • 3 weight fusion approaches are considered.

3.1. Unbounded Fusion

  • where wi is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel).
  • However, since the scalar weight is unbounded, it could potentially cause training instability. Bounded range is needed.

3.2. Softmax-based Fusion

  • An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1.
  • However, the extra softmax leads to significant slowdown on GPU hardware.

3.3. Fast Normalized Fusion

  • Finally, Fast Normalized Fusion is adopted where wi ≥ 0 is ensured by applying a Relu after each wi. ε = 0.0001 is to avoid numerical instability.
  • The value of each normalized weight also falls between 0 and 1, but since there is no softmax operation here, it is much more efficient, runs up to 30% faster on GPUs.
  • For example, the two fused features at level 6 for BiFPN shown above is:
  • where Ptd_6 is the intermediate feature at level 6 on the top-down pathway, and Pout_6 is the output feature at level 6 on the bottom-up pathway.
  • Notably, to further improve the efficiency, depthwise separable convolution is used for feature fusion.
  • Batch normalization and activation are added after each convolution.

4. EfficientDet: Network Architecture

EfficientDet: Network Architecture
  • One-stage detectors paradigm is used.
  • ImageNet-pretrained EfficientNets are employed as the backbone network.
  • The proposed BiFPN serves as the feature network, which takes level 3–7 features {P3, P4, P5, P6, P7} from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion.
  • These fused features are fed to a class and box network to produce object class and bounding box predictions respectively. Similar to RetinaNet, the class and box network weights are shared across all levels of features.

5. Compound Scaling Method

  • Previous work scale up the detector by using either larger backbone, larger input image, or stacking more FPN layers, i.e. single-factor scaling.
  • Recently, EfficientNet jointly scales up all dimensions of network width, depth, and input resolution.
  • Here, EfficientDet proposes a new compound scaling method for object detection, which uses a simple compound coefficient φ to jointly scale up all dimensions of backbone network, BiFPN network, class/box network, and resolution.
Scaling configs for EfficientDet D0-D6

5.1. Backbone Network

  • The same width/depth scaling coefficients of EfficientNet-B0 to B6 are used such that ImageNet-pretrained checkpoints can be used.

5.2. BiFPN Network

  • The BiFPN depth D_bifpn (#layers) is linearly increased since depth needs to be rounded to small integers.
  • BiFPN width W_bifpn (#channels) is exponentially grown as similar to EfficientNet. Specifically, a grid search is performed on a list of values {1.2, 1.25, 1.3, 1.35, 1.4, 1.45}, and the best value of 1.35 is picked as the BiFPN width scaling factor.
  • Formally, BiFPN width and depth are scaled with the following equation:

5.3. Box/Class Prediction Network

  • Their width is fixed to be always the same as BiFPN (i.e., W_pred = W_bifpn).
  • The depth (#layers) is linearly increased using equation:

5.4. Input Image Resolution

  • Since feature level 3–7 are used in BiFPN, the input resolution must be dividable by 2⁷ =128, so resolutions are linearly increased:

6. SOTA Comparison

EfficientDet performance on COCO
  • The above table compares EfficientDet with other object detectors, under the single-model single-scale settings with no test-time augmentation.
  • Accuracy for both test-dev (20K test images) and val (5K validation images) are tested.
  • EfficientDet achieves better efficiency than previous detectors, being 4×— 9× smaller and using 13× — 42× less FLOPs across a wide range of accuracy or resource constraints.
  • On relatively low-accuracy regime, EfficientDet-D0 achieves similar accuracy as YOLOv3 with 28× fewer FLOPs.
  • Compared to RetinaNet [21] and Mask R-CNN [11], our EfficientDet-D1 achieves similar accuracy with up to 8× fewer parameters and 21× fewer FLOPs.
  • On high-accuracy regime, EfficientDet also consistently outperforms recent NAS-FPN [8] and its enhanced versions in [42] (i.e. with AutoAugment) with much fewer parameters and FLOPs.
  • In particular, EfficientDet-D7 achieves a new state-of-the-art 52.2 AP on test-dev and 51.8 AP on val for single-model single-scale.

7. Ablation Study

7.1. Backbone and BiFPN

Disentangling backbone and BiFPN
  • ResNet+FPN is used as baseline.
  • Then the backbone is replaced with EfficientNet-B3, which improves accuracy by about 3 AP with slightly less parameters and FLOPs.

By further replacing FPN with the proposed BiFPN, additional 4 AP gain is achieved with much fewer parameters and FLOPs.

7.2. Different Feature Networks

Comparison of different feature networks
  • While repeated FPN+PANet achieves slightly better accuracy than NAS-FPN, it also requires more parameters and FLOPs.

The proposed BiFPN achieves similar accuracy as repeated FPN+PANet, but uses much less parameters and FLOPs.

With the additional weighted feature fusion, the proposed BiFPN further achieves the best accuracy with fewer parameters and FLOPs.

7.3. Different Feature Fusion

Comparison of different feature fusion

The fast fusion achieves similar accuracy as softmax-based fusion, but runs 28% — 31% faster.

Softmax vs. fast normalized feature fusion
  • 3 nodes are randomly selected in BiFPN in EfficientDet-D3.

Despite the rapid change, the proposed fast normalized fusion approach always shows very similar learning behavior to the softmax-based fusion for all three nodes.

7.4. Compound Scaling

Comparison of different scaling methods
  • The above figure compares the proposed compound scaling with other alternative methods that scale up a single dimension of resolution/depth/width.

Compound scaling achieves better efficiency than other methods, suggesting the benefits of jointly scaling by better balancing difference architecture dimensions.

--

--

Sik-Ho Tsang
CodeX

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.