Fifteen Minutes with FiftyOne: EfficientDet
Visualizing a scalable and efficient object detector
Last year, Google researchers created one of the most efficient and highest performing image classification networks, EfficientNet [2]. This year the same team built upon EfficientNet to create a great object detector, EfficientDet [1].
One of the most interesting parts of EfficientDet is the built-in scalability. EfficientDet comes in multiple configurations ranging from EfficientDet-D0 to EfficientDet-D7 where each provides a different accuracy and speed tradeoff.
What makes it so efficient?
In my previous post on YOLOv4, I discussed the idea that single-stage object detectors are composed of a backbone network (like a ResNet trained on ImageNet), a head that uses backbone feature maps to detect objects (like YOLO or SSD), and a neck that connects the two (like Feature Pyramid Networks). With EfficientDet come novel new approaches to constructing the backbone and neck.
Backbone
Instead of using a standard ResNet, ResNeXt, AmoebaNet, etc, the authors used their previous work, EfficientNet, as the backbone of their new object detector.
- EfficientNet is scalable with 7 provided scales pretrained on ImageNet
- It was one of the highest performing ImageNet classifiers at the time of release (EfficientNet-B7 achieves 84.4% accuracy)
BiFPN Neck
Instead of using a standard Feature Pyramid Network (FPN) [3], the authors developed a weighted bi-directional FPN (BiFPN). The purpose of all of these networks is to fuse features at multiple levels, or resolutions, of the backbone (P3–7 in the image below).
BiFPN Improvements:
- Weighted feature fusion: learn how to weigh different resolutions of features instead of treating them equally
- Use depth-wise separable convolutions in the BiFPN
Digging in with FiftyOne
Let’s see what some results of EfficientDet scales D0–7 look like on the MS COCO validation set using FiftyOne.
IoU
When analyzing the object localization quality of an object detector, it is useful to look at the IoUs of predictions with underlying ground truth objects. This evaluation data was computed and collected in FiftyOne and displayed in Matplotlib.
As you can see, there is a general increase in bounding box tightness as you go from D0 to D7. The best model is surprisingly EfficientDet-D6 instead of EfficientDet-D7.
True and False Positives
The number of true and false positives can be computed automatically in FiftyOne. These counts seem to show a similar trend as the previous IoU results; EfficientDet-D7 may not be the best model out of this set. EfficientDet-D7 has one of the highest counts of true positives but it is lower than EfficientDet-D6 when using the relative percent of true positives to total detections (76.5% and 74.5% true positives for D6 and D7 respectively).
EfficientDet D6 vs D7
After the previous sections, I decided to look at some examples in FiftyOne and explicitly compare EfficientDet-D6 to D7.
Going through multiple samples in FiftyOne, a trend started to emerge. Both D6 and D7 occasionally had predictions that did not match the ground truth. However, more often than not the spurious predictions from D6 were correct but not annotated while the spurious D7 predictions were just incorrect. As shown in the samples above.
The final mAP of EfficientDet-D7 is higher than that of EfficientDet-D6, as shown in the plot above from the paper. My interpretation of this is while there is a higher percentage of correct detections from EfficientDet-D6 and tighter bounding boxes from EfficientDet-D6, there are more predictions that are both correct with a tight bounding box from EfficientDet-D7. If you value tight bounding boxes or correct predictions separately over anything else, then you might want to use EfficientDet-D6 over D7.
Conclusion
EfficientDet is a state-of-the-art object detector that is flexible to the requirements of the user. Multiple design decisions were made to allow EfficientDet to be useful to the public, including creating multiple different sized models and also providing the model weights and open source code. Thanks to FiftyOne, we were quickly able to identify the strengths and weaknesses of each EfficientDet model so that deciding which to use is easier in the future. Surprisingly, that choice is not always to just pick the model with the highest mAP.
If you want to look through the outputs of EfficientDet yourself, you can load them up here: https://github.com/voxel51/fiftyone-examples/blob/master/examples/comparing_YOLO_and_EfficientDet.ipynb
References
[1] Mingxing Tan, et al, EfficientDet: Scalable and Efficient Object Detection, CVPR (2020)
[2] Mingxing Tan, et al, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, ICML (2019)
[3] Tsung-Yi Lin, et al, Feature pyramid networks for object detection, CVPR (2017)
About Me
My name is Eric Hofesmann. I received my master’s in Computer Science, specializing in computer vision, at the University of Michigan. During my graduate studies, I realized that it was incredibly difficult to thoroughly analyze a new model or method without serious scripting to visualize and search through outputs. Working at the computer vision startup, Voxel51, I helped develop FiftyOne to help researchers and myself quickly load up and start looking through datasets and model results. This series of posts go through state-of-the-art computer vision models and datasets and analyzes them with FiftyOne.