Fifteen Minutes with FiftyOne: EfficientDet

Visualizing a scalable and efficient object detector

Published in

Voxel51

5 min readSep 2, 2020

Last year, Google researchers created one of the most efficient and highest performing image classification networks, EfficientNet [2]. This year the same team built upon EfficientNet to create a great object detector, EfficientDet [1].

One of the most interesting parts of EfficientDet is the built-in scalability. EfficientDet comes in multiple configurations ranging from EfficientDet-D0 to EfficientDet-D7 where each provides a different accuracy and speed tradeoff.

EfficientDet performance vs speed tradeoff (source)

What makes it so efficient?

In my previous post on YOLOv4, I discussed the idea that single-stage object detectors are composed of a backbone network (like a ResNet trained on ImageNet), a head that uses backbone feature maps to detect objects (like YOLO or SSD), and a neck that connects the two (like Feature Pyramid Networks). With EfficientDet come novel new approaches to constructing the backbone and neck.

Backbone

Instead of using a standard ResNet, ResNeXt, AmoebaNet, etc, the authors used their previous work, EfficientNet, as the backbone of their new object detector.

EfficientNet is scalable with 7 provided scales pretrained on ImageNet
It was one of the highest performing ImageNet classifiers at the time of release (EfficientNet-B7 achieves 84.4% accuracy)

BiFPN Neck

Instead of using a standard Feature Pyramid Network (FPN) [3], the authors developed a weighted bi-directional FPN (BiFPN). The purpose of all of these networks is to fuse features at multiple levels, or resolutions, of the backbone (P3–7 in the image below).

BiFPN Improvements:

Weighted feature fusion: learn how to weigh different resolutions of features instead of treating them equally
Use depth-wise separable convolutions in the BiFPN

Improvements in AP on MS COCO with the new backbone and neck (source)

Digging in with FiftyOne

Let’s see what some results of EfficientDet scales D0–7 look like on the MS COCO validation set using FiftyOne.

IoU

When analyzing the object localization quality of an object detector, it is useful to look at the IoUs of predictions with underlying ground truth objects. This evaluation data was computed and collected in FiftyOne and displayed in Matplotlib.

Prediction IoU with ground truth objects for EfficientDet scales computed in FiftyOne visualized in matplotlib

As you can see, there is a general increase in bounding box tightness as you go from D0 to D7. The best model is surprisingly EfficientDet-D6 instead of EfficientDet-D7.

True and False Positives

(**Left)** Base count of detections | (**Middle)** True positives, higher is better | (**Right)** False positives, lower is better

The number of true and false positives can be computed automatically in FiftyOne. These counts seem to show a similar trend as the previous IoU results; EfficientDet-D7 may not be the best model out of this set. EfficientDet-D7 has one of the highest counts of true positives but it is lower than EfficientDet-D6 when using the relative percent of true positives to total detections (76.5% and 74.5% true positives for D6 and D7 respectively).

EfficientDet D6 vs D7

After the previous sections, I decided to look at some examples in FiftyOne and explicitly compare EfficientDet-D6 to D7.

EfficientDet-D6 in blue, D7 in green | (**Left**) D6 predicted a non-annotated bottle correctly | (**Right**) D7 mistakenly predicted a baseball glove

Going through multiple samples in FiftyOne, a trend started to emerge. Both D6 and D7 occasionally had predictions that did not match the ground truth. However, more often than not the spurious predictions from D6 were correct but not annotated while the spurious D7 predictions were just incorrect. As shown in the samples above.

The final mAP of EfficientDet-D7 is higher than that of EfficientDet-D6, as shown in the plot above from the paper. My interpretation of this is while there is a higher percentage of correct detections from EfficientDet-D6 and tighter bounding boxes from EfficientDet-D6, there are more predictions that are both correct with a tight bounding box from EfficientDet-D7. If you value tight bounding boxes or correct predictions separately over anything else, then you might want to use EfficientDet-D6 over D7.

Conclusion

EfficientDet is a state-of-the-art object detector that is flexible to the requirements of the user. Multiple design decisions were made to allow EfficientDet to be useful to the public, including creating multiple different sized models and also providing the model weights and open source code. Thanks to FiftyOne, we were quickly able to identify the strengths and weaknesses of each EfficientDet model so that deciding which to use is easier in the future. Surprisingly, that choice is not always to just pick the model with the highest mAP.

If you want to look through the outputs of EfficientDet yourself, you can load them up here: https://github.com/voxel51/fiftyone-examples/blob/master/examples/comparing_YOLO_and_EfficientDet.ipynb

References

[1] Mingxing Tan, et al, EfficientDet: Scalable and Efficient Object Detection, CVPR (2020)

[2] Mingxing Tan, et al, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, ICML (2019)

[3] Tsung-Yi Lin, et al, Feature pyramid networks for object detection, CVPR (2017)

About Me

My name is Eric Hofesmann. I received my master’s in Computer Science, specializing in computer vision, at the University of Michigan. During my graduate studies, I realized that it was incredibly difficult to thoroughly analyze a new model or method without serious scripting to visualize and search through outputs. Working at the computer vision startup, Voxel51, I helped develop FiftyOne to help researchers and myself quickly load up and start looking through datasets and model results. This series of posts go through state-of-the-art computer vision models and datasets and analyzes them with FiftyOne.