Nerd For Tech
Published in

Nerd For Tech

YOLOv4 Paper Summary & Analysis


Discussion led by Katie Yang, Evelyn Wu, and Wendy Huang, Intelligent Systems subteam

Objectives of the Paper

  • Develop a real-time object detection that can be trained on a standard GPU. They explore the performance and speed tradeoffs of appending new features such as mosaic data augmentation, Mish-activation, and DropBlock regularization of the YOLO v3 architecture, modified to accommodate these new features. YOLOv4 wants to achieve high accuracy and perform real-time detection, as most of the accurate models are not real-time.
  • Test out a wide variety of new features and their combinations that are proclaimed to be able to enhance CNN accuracy on large datasets.

YOLOv4 is a one-stage object detection model that builds off of the original YOLO models. Modern object detectors are usually composed of two components, a backbone and a head. The backbone is typically pre-trained on larger image classification dataset, usually ImageNet, and serves to encode relevant information about the input. The head predicts object classes and bounding box information. This paper also identifies a “neck”, which they define as layers between the backbone and head that serve to collect feature maps from different stages of the network.

The paper also collects training methods that it categories into “bag-of-freebies”(BoF) and “bag-of-specials”(BoS). BoF are training methods that have either only have an impact on training strategy or the training cost. BoS are training strategies that increase inference cost by a small amount but also provide potential increases in model performance.

Paper Contributions

What methods did the paper propose to address the problem?
The paper proposed various data augmentation strategies to improve the efficacy of the model training process without increasing the model’s demands on computing power and RAM. The most successful methods mentioned by the authors were Mosaic Image clipping, Self-Adversarial Training (SAT) , and Cross mini-batch Normalization (CmBN). Mosaic Image clipping formed each individual sample by composing four individual images together. SAT was a unique training regimen that had the model first attempt to remove the object in question and perform object detection on the edited image. CmBN allowed data scientists to assess statistics across multiple mini-batches, providing a better overview on the performance of the model.

How are the paper’s contributions different from previous related works?
The paper expands on multiple prior works utilizing work within the field of Neural Networks and within Computer Vision to improve the overall performance of the detection module. The paper in itself does not propose a grand new architecture, but rather focuses on utilizing the findings within the field to power their model and allow it to a more democratized model. In particular, it runs twice as fast as EfficientNet with comparable performance, and it improves YOLOv3’s AP and FPS by 10% and 12% respectively. This was all done on a single GPU, which is much more accessible to individuals, giving the power to train a real time detection system to (almost) everyone.

How did the paper assess its results?
It measured its results on the Pareto optimality curve in comparison to the other state-of-the art methods, so the tradeoff curve between speed and accuracy. In addition, they also measured the AP (Average Precision) values for different confidence intervals. In comparison to previously mentioned methods, they did place higher on the general optimality curve, and though the separate data augmentation techniques did not improve majorly in comparison to the other methods, the combined method did quite well.

Paper Limitations, Further Research, and/or Potential Applications

One of the biggest applications/contributions that this paper makes is that it helps improve the feasibility/practicality of using the YOLO models for object detection. By developing YOLOv4 such that it could be trained and tested on only one GPU, it reduces the amount of computational resources needed to use this model.

While the proposed framework produces state of the art results at high speeds, they were only trained on a single GPU. The results from these experiments are quite promising, but in practice there are few instances where one is limited to a single GPU during the training step, rather than just inference. This then begs the question as to what results could be achieved when training with multiple GPUs, and this could also lead to a potential application: upscaling the framework for industry standard model training.




NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Recommended from Medium

New approaches to Deep Networks:

Convolutional neural networks for dummies

Building an Indian Chaat classifier

Reinforcement Learning, Part 7: A Brief Introduction to Deep Q Networks

Anomaly Detection with LSTM in Keras

Performance estimate of pattern recognition tool

Cross Validation in Machine Learning

Why Activation Functions?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cornell Data Science

Cornell Data Science

Cornell Data Science is an engineering project team @Cornell that seeks to prepare students for a career in data science.

More from Medium

Distracted Driver Detection using Deep Neural Networks

Training an Image Classifier in Pytorch

Model Building and Model Evaluation

All you need to know about Computer Vision