Fifteen Minutes with FiftyOne: YOLOv4
A closer look at one of the fastest object detectors yet
This is the first post in a series where I will be taking a look at state-of-the-art computer vision models and datasets through the new tool FiftyOne. FiftyOne lets you easily visualize datasets and labels to find interesting artifacts in your model predictions or annotations.
YOLOv4
YOLOv4 [1] is a recent installment of the single-shot YOLO object detection model that came out earlier this year. There has been some controversy around YOLOv5 but that is a story for a different time [2]. This iteration of YOLO comes loaded with architectural optimizations and loads and loads of new tricks and methods.
YOLOv4 has a high mAP on the MS COCO dataset at speeds of 70 to 120 FPS and is designed to be trained and used on a single GPU!
The guts of YOLOv4
The improvements behind YOLOv4 fall under three categories: architectural updates, Bags of Freebies, and Bags of Specials.
Architecture
The architecture of YOLOv4 consists of a backbone (like a ResNet trained on ImageNet), a head that uses backbone features maps to detect objects (like YOLO and SSD), and a neck that connects the two (like Feature Pyramid Networks).
In particular, the selected architectures are:
- Backbone: CSPDarknet53 (A different backbone is used for the CPU version)
- Neck: SPP, PAN
- Head: YOLOv3
Bags of Freebies (BoF)
Bags of Freebies are training methods that improve performance but are “free” during inference time. There are different BoFs used for training the backbone and the detector. There are some methods or augmentations that are novel to YOLOv4 marked with “NEW”.
Backbone BoF highlights include:
- CutMix data augmentation
- Mosaic data augmentation (NEW: Mixes 4 different training images, allowing detection of objects outside their normal context)
- DropBlock regularization
- Class label smoothing
Detector BoF highlights include:
- Complete IoU loss
- Cross mini-Batch Normalization (NEW: A modified Cross-Iteration Batch Normalization)
- DropBlock regularization
- Mosaic data augmentation
- Self-Adversarial Training (NEW: A new data augmentation where the network first alters the image in an adversarial attack on itself before training to detect an object on this modified image.)
- Eliminate grid sensitivity
- Using multiple anchors for a single ground truth
- Cosine annealing scheduler
- Optimal hyperparameters
- Random training shapes
Bags of Specials (BoS)
Bags of Specials are training methods that improve performance but are “free” during inference time. Like BoFs, there are different BoSs used for training the backbone and the detector.
Backbone BoS highlights include:
- Mish activation
- Cross-stage partial connections (CSP)
- Multi-input weighted residual connections (MiWRC)
Detector BoS highlights include:
- Mish activation
- SPP-block
- NEW: A modified SAM-block
- NEW: A modified PAN path-aggregation block
- DIoU-NMS
Digging in with FiftyOne
Let’s see how YOLOv4 compares with YOLOv2 by loading up MS COCO validation in FiftyOne.
Tighter Boxes
With YOLOv4 comes a significant improvement in the tightness of bounding boxes.
YOLOv4 has more True Positives
After thresholding both YOLOv4 and YOLOv2 to contain roughly 29,000 detections, YOLOv4 had 8,000 more true positives than YOLOv2 at an IoU of 0.75. As a result, YOLOv4 also had significantly fewer false positives.
False Positives by class
While YOLOv4 has significantly few false positives overall than YOLOv2, the percentage of false positives by class varies. Even though fewer “car” detections were missed by YOLOv4 in total, the percentage of “cars” missed compared to other classes was higher than by YOLOv2.
Conclusion
The improvements to YOLOv4 produced a high performing model that is designed to be friendly to users with limited computing resources. After analyzing the results with FiftyOne, the distribution of false positives by class appears similar to that of YOLOv2 indicating the mAP increases are likely due to tighter bounding boxes.
If you want to look through the outputs of YOLOv4 yourself, you can load them up here: https://github.com/voxel51/fiftyone-examples/blob/master/examples/comparing_YOLO_and_EfficientDet.ipynb
References
[1] Alexey Bochkovskiy, et al, YOLOv4: Optimal Speed and Accuracy of Object Detection (2020)
[2] Ritesh Kanjee, YOLOv5 Controversy — Is YOLOv5 Real?,(2020)
About Me
My name is Eric Hofesmann. I received my master’s in Computer Science, specializing in computer vision, at the University of Michigan. During my graduate studies, I realized that it was incredibly difficult to thoroughly analyze a new model or method without serious scripting to visualize and search through outputs. Working at the computer vision startup, Voxel51, I helped develop FiftyOne to help researchers and myself quickly load up and start looking through datasets and model results. This series of posts go through state-of-the-art computer vision models and datasets and analyzes them with FiftyOne.