YOLOv4 — Version 0: Introduction

An Introductory Guide on the Fundamentals and Algorithmic Flows of YOLOv4 Object Detector

Published in

VisionWizard

6 min readMay 19, 2020

Source: Photo by Joanna Kosinska on Unsplash

First introduced in 2015, YOLO quickly rose to fame as one of the fastest dense object detectors with its surprisingly fast inference speed and decent results. Till last year, it has remained the king of one stage object detectors.

This year, it has manifested itself as the boss of One-Shot Object Detectors. Yes, YOLOv4¹ has arrived with the demanding and most interesting upgrades proving the best state-of-the-art detectors in terms of accuracy and inference speed.

Revised and novel algorithms/modules are added in this version, and thus we have decided to breakdown this research into five different explanatory parts. This will help the readers to understand and grasp different updated and newly added architectural designs.

The below-given flow will be followed throughout this mini-series on YOLOv4.

YOLOv4 — Version 0: Introduction
YOLOv4 — Version 1: Bag of Freebies
YOLOv4 — Version 2: Bag of Specials
YOLOv4 — Version 3: Proposed Workflow
YOLOv4 — Version 4: Final Verdict

This article sheds light upon architectural flows used in the network design of YOLOv4.
After finishing this blog, you will get an overview of the network design and optimization strategies that are presented in the paper and the further articles in the series will discuss each one of them in detail.

1. Introduction to Object Detectors

Conventional Object Detectors are comprised of two parts. One is Backbone used for feature extraction an the other is Head used for the loss calculations and predictions during the time of inference.
Some of the famous backbones used are VGG², ResNet³, DenseNet⁴, etc. Also, different heads are suited for that particular network. For example, most famous heads for two-stage sparse object detectors are FastRCNN⁵, FasterRCNN⁶, MaskRCNN⁷, etc. Similarly, some of the different dense detector heads are RPN⁸, YOLO⁹, SSD¹⁰, etc.
Neck is introduced in recent detectors. It is directly leveraged into backbones for enhancing the richness and semantic representation of extracted features for objects of different shapes and sizes.

Fig. 1: Flow of Object Detection Process.

2. Network Definition of YOLOv4

Numerous architectural design candidates were shortlisted for the YOLOv4 model generation and were finalized to the given baselines.

Fig 2. Modules which are listed and used during the ablation study of YOLOv4

— Backbone

Fig. 3 Different backbone configurations which were shortlisted for YOLOv4[Source⁰]

Three different backbones were shortlisted as shown in figure 3. A decent model for the detector is one that has a large number of parameters and a high receptive field. Yet, different aspects such as FPS, number of floating point operations(FLOPS), etc. should also be taken into consideration.
After rigor analysis of different parameters on standard benchmarks, the authors finalized CSPDarknet5³¹¹ as the network backbone of Yolov4 architecture.

— Neck

Fig. 4 (a) Original PAN (b) Modified PAN

As stated above, neck is leveraged in backbones for the extraction of rich semantic features that are further used for accurate predictions. One important benchmark for this is the receptive field(You can get a good amount of information from this link).
After optimal testings, Spatial Pyramid Pooling(SPP)¹² was tightly coupled with CSPDarknet53 which helped in the drastic increase of the receptive field and Modified Path Aggregation Networks¹³ for pyramidal structure instead of Feature Pyramid Networks used in YOLOv³¹⁴.

— Head

YOLOv³¹⁴ Head is taken into consideration for loss propagation and predictions.

Fig. 5 Final Baseline Architecture of YOLOv4

3. Performance Optimizations

Apart from different selective approaches in architecture designs, the authors also added two new “Bags” or optimization procedures to be used at the time of training and inference. These are stated as Bag of Freebies(BoG) and Bag of Specials(BoS).

Fig. 6 Two types of approaches used in YOLOv4 for performance optimizations in terms of accuracy.

3.1 Bag of Freebies

Bag of Freebies are the different combinations of training methods developed which can make the object detector receive better accuracy without increasing the inference cost.

Bag of Freebies was filled with different approaches used for both, backbone and detector modules in YOLOv4.

— Backbone

CutMix and Mosaic Augmentations helping the detector to learn different types of distribution of a given image under challenging circumstances such as occlusion, noise, etc.
Drop Block Regularization¹⁵ for learning spatially discriminating features.
Class Label Smoothing for better generalization on a dataset.

— Detector

CIoU Loss¹⁶ for better convergence on a bounding box regression.
Cross mini Batch Normalization¹ for collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch.
Self Adversarial Training and Mosaic Augmentations for making them forbidden by adversarial attacks on a CNN.
Grid Sensitivity Elimination for solving the problem of the undetectable objects in an image.
Multiple Anchors for a particular ground truth for better regression stability
Cosine Annealing Scheduler for adjustment of the learning rate for sinusoidal training.
Random Anchor Shapes for consideration of generalized spatial sizes of the objects in an image.

3.2 Bag of Specials

Bag of Specials are the different plugins and the post processing modules that only increase the inference cost by a small amount but can drastically improve the accuracy of object detector.

Plugin modules are the list of attributes or algorithms to increase the receptive field, strengthening feature integration capability, etc. whereas post-processing modules are used to filter out detections predicted by a detector. BoS for backbone and detector are mentioned below.

— Backbone

Mish Activation¹⁷ used in the final inference version which helped in getting a ~1% increase in Top 1 accuracy.
Cross Stage Partial Connections¹¹ for reducing total multiplications with convolutional filters/reducing time complexity.
Multi-input weighted residual connections (MiWRC).

— Detector

Mish Activation
Self Attention Module(SAM)¹⁸ for capturing global dependencies and tight input training.
Spatial Pyramid Pooling(SPP)¹² for increasing the receptive field of the overall network.
Path Aggregation Networks(PAN)¹³ for the better concatenation of local textures and global features of an object achieving better semanticity and accuracy.
DIoU NMS¹⁶ to remove the present confidence degradation and occlusion problems of standard NMS procedures.

4. Some Useful Links And Extra References To This Research

[1] YOLOv4 [code][paper]
[11] CSPNet
[12] Spatial Pyramid Pooling
[13] Path Aggregation Networks
[14] YOLOv3
[15] DropBlock Regularization
[16] CIoU Loss
[17] Mish Activation Function
[18] Self Attention Module

Here ends the first part of this series. This was an overview of the yolov4 paper. Furthermore, we will cover the working and logical explanation of the modules. In-depth learning of our readers is the end-goal, so go through this series by upgrading your own version from 0 to 4 and become as accurate as YOLOv4.
Stay tuned to for the detailed explanatory versions of this beautiful research.
Next Article: YOLOv4 — Version 1: Bag of Freebies.

YOLOv4 — Version 1: Bag of Freebies

An Introductory Guide on the Fundamentals and Algorithmic Flows of YOLOv4 Object Detector

medium.com

Please check out other parts of our entire series on Yolov4 on our page VisionWizard.
Looks like you have a real interest in quality research work if you have reached till here. If you like our content and want more of it, please follow our page VisionWizard.
Do clap if you have learned something interesting and useful. It would motivate us to curate more quality content for you guys.
Thanks for your time :)