Fifteen Minutes with FiftyOne: YOLOv4

A closer look at one of the fastest object detectors yet

Published in

Voxel51

4 min readAug 29, 2020

This is the first post in a series where I will be taking a look at state-of-the-art computer vision models and datasets through the new tool FiftyOne. FiftyOne lets you easily visualize datasets and labels to find interesting artifacts in your model predictions or annotations.

YOLOv4

YOLOv4 [1] is a recent installment of the single-shot YOLO object detection model that came out earlier this year. There has been some controversy around YOLOv5 but that is a story for a different time [2]. This iteration of YOLO comes loaded with architectural optimizations and loads and loads of new tricks and methods.

YOLOv4 has a high mAP on the MS COCO dataset at speeds of 70 to 120 FPS and is designed to be trained and used on a single GPU!

The guts of YOLOv4

The improvements behind YOLOv4 fall under three categories: architectural updates, Bags of Freebies, and Bags of Specials.

Architecture

The architecture of YOLOv4 consists of a backbone (like a ResNet trained on ImageNet), a head that uses backbone features maps to detect objects (like YOLO and SSD), and a neck that connects the two (like Feature Pyramid Networks).

In particular, the selected architectures are:

Backbone: CSPDarknet53 (A different backbone is used for the CPU version)
Neck: SPP, PAN
Head: YOLOv3

Bags of Freebies (BoF)

Bags of Freebies are training methods that improve performance but are “free” during inference time. There are different BoFs used for training the backbone and the detector. There are some methods or augmentations that are novel to YOLOv4 marked with “NEW”.

Backbone BoF highlights include:

CutMix data augmentation
Mosaic data augmentation (NEW: Mixes 4 different training images, allowing detection of objects outside their normal context)
DropBlock regularization
Class label smoothing

Detector BoF highlights include:

Complete IoU loss
Cross mini-Batch Normalization (NEW: A modified Cross-Iteration Batch Normalization)
DropBlock regularization
Mosaic data augmentation
Self-Adversarial Training (NEW: A new data augmentation where the network first alters the image in an adversarial attack on itself before training to detect an object on this modified image.)
Eliminate grid sensitivity
Using multiple anchors for a single ground truth
Cosine annealing scheduler
Optimal hyperparameters
Random training shapes

Bags of Specials (BoS)

Bags of Specials are training methods that improve performance but are “free” during inference time. Like BoFs, there are different BoSs used for training the backbone and the detector.

Backbone BoS highlights include:

Detector BoS highlights include:

Mish activation
SPP-block
NEW: A modified SAM-block
NEW: A modified PAN path-aggregation block
DIoU-NMS

Digging in with FiftyOne

Let’s see how YOLOv4 compares with YOLOv2 by loading up MS COCO validation in FiftyOne.

Tighter Boxes

With YOLOv4 comes a significant improvement in the tightness of bounding boxes.

Predictions visualized in FiftyOne | Blue: YOLOv2 | Green: YOLOv4

YOLOv4 has more predictions with an IoU > 0.8 than YOLOv2

YOLOv4 has more True Positives

After thresholding both YOLOv4 and YOLOv2 to contain roughly 29,000 detections, YOLOv4 had 8,000 more true positives than YOLOv2 at an IoU of 0.75. As a result, YOLOv4 also had significantly fewer false positives.

False Positives by class

While YOLOv4 has significantly few false positives overall than YOLOv2, the percentage of false positives by class varies. Even though fewer “car” detections were missed by YOLOv4 in total, the percentage of “cars” missed compared to other classes was higher than by YOLOv2.

False-positive clas distribution displayed in FiftyOne

Conclusion

The improvements to YOLOv4 produced a high performing model that is designed to be friendly to users with limited computing resources. After analyzing the results with FiftyOne, the distribution of false positives by class appears similar to that of YOLOv2 indicating the mAP increases are likely due to tighter bounding boxes.

If you want to look through the outputs of YOLOv4 yourself, you can load them up here: https://github.com/voxel51/fiftyone-examples/blob/master/examples/comparing_YOLO_and_EfficientDet.ipynb

References

[1] Alexey Bochkovskiy, et al, YOLOv4: Optimal Speed and Accuracy of Object Detection (2020)

[2] Ritesh Kanjee, YOLOv5 Controversy — Is YOLOv5 Real?,(2020)

About Me

My name is Eric Hofesmann. I received my master’s in Computer Science, specializing in computer vision, at the University of Michigan. During my graduate studies, I realized that it was incredibly difficult to thoroughly analyze a new model or method without serious scripting to visualize and search through outputs. Working at the computer vision startup, Voxel51, I helped develop FiftyOne to help researchers and myself quickly load up and start looking through datasets and model results. This series of posts go through state-of-the-art computer vision models and datasets and analyzes them with FiftyOne.