YOLOv4 — Version 3: Proposed Workflow

An Introductory Guide on the Fundamentals and Algorithmic Flows of YOLOv4 Object Detector

Published in

VisionWizard

9 min readMay 26, 2020

Source: Photo by Joanna Kosinska on Unsplash

Welcome to the mini-series on YOLOv4. This article will be addressing the overall algorithm and core components proposed in final design of Yolov4 that gives optimum convergence between speed and accuracy.

YOLOv4 — Version 0: Introduction
YOLOv4 — Version 1: Bag of Freebies
YOLOv4 — Version 2: Bag of Specials
YOLOv4 — Version 3: Proposed Workflow
YOLOv4 — Version 4: Final Verdict

Yolov4 is built on the principles for achieving the best possible interleaved combination of Speed, Accuracy and Parallel Computation.

Without sacrificing either, it manages to surpass the yolov3 and other real time object detectors by using optimal amalgamation of techniques from BoF (Bag of Freebies) and BoS(Bag of Specials) along with architecture design that results into an efficient real time object detector.

Authors designed the overall architecture from a GPU training perspective which is unheard of, making it reliable to reproduce the results across conventional GPU skews.

I believe this is a step in right direction as sometimes researchers are so invested behind their idea, they forget the mass applicability of their work altogether.

They optimized the model architecture for efficient use of parallel computational power of conventional GPU(s) which is not the case with other object detectors.

Selection Architecture

The author’s objective is to find the optimal balance among the

Input network resolution,
Convolutional layer number,
Parameter number (filter_size² * filters * channel / groups),
Number of layer outputs (filters).

The authors performed several ablation studies using two backbones

CSPDarknet53[9]
CSPResNext50[9]

A major finding from these studies revealed the superiority of CSPDarknet53 in COCO object detection tasks while CSPResNext50 in object classification on the ILSVRC2012 (ImageNet) and thus chosen as the backbone for yolov4.

In yolov4 architecture, there are two core components.

Dense Blocks
Cross Stage Partial Networks (CSP)
Path Aggregation Network (PANet)
Spatial Pyramid Pooling (SPP)

For in-depth explanation of above mentioned components, please visit link. We covered this in the Bag Of Specials version of this mini-series.

Dense Blocks

A Dense Block contains multiple convolution layers with each layer Hi composed of batch normalization, ReLU, and followed by convolution. It increases the receptive field of the entire network.

Instead of using the only last layer output, Hi takes the output of all previous layers as well as the original as its input. i.e. x₀, x₁, …, and xᵢ₋₁. Each Hi below outputs four feature maps. Therefore, at each layer, the number of feature maps is increased by four — the growth rate.[13]

Cross-Stage-Partial-Networks(CSP)

CSPNet separates the input feature maps of the DenseBlock into two parts.

The first part x₀’ bypasses the DenseBlock and becomes part of the input to the next transition layer.
The second part x₀’’ will go thought the Dense block as below.

Fig : DenseNet Block and CSPDenseNet Block[9]

This new design reduces the computational complexity by separating the input into two parts — with only one going through the Dense Block.

CSPDarknet53

YOLOv4 utilizes the CSP connections above with the Darknet-53 below as the backbone in feature extraction.

Fig : Darknet53 Architecture from Yolov3 [11]

The Yolov4 backbone design gave priority to larger receptive fields and best feature aggregation techniques.

The receptive field in a convolutional neural network refers to the part of the image that is visible to one filter at a time. It increases linearly as we stack more convolutional layers or increases exponentially when we stack dilated convolutions.

Feature aggregation is accommodation of features from different levels in the backbone to account for varied scales of object in the scene.

The backbones are task specific. Opposed to classification, detection tasks has specific requirements for better accuracy/speed.

Higher input network size (resolution) for detecting multiple small-sized objects

More layers for a higher receptive field to cover the increased size of input network
More parameters for greater capacity of a model to detect multiple objects of different sizes in a single image.

Fig : Classification Results for different backbone[1]

Ablation results from Fig 2 clearly outlines CSPDarknet53[9] as superior from the rest when it comes to object detection task. It has more parameters(27.6M) and a larger receptive field size(725x725).

For extensive architecture details and pretrained versions of CSPDarknet53, please refer link

SPP(Spatial Pyramid Matching)[10] block over the CSPDarknet53 has three major advantages

Significantly increases the receptive field,
Separates out the most significant context features and
Causes almost no reduction of the network operation speed.

SPP applies a slightly different strategy in detecting objects of different scales. It replaces the last pooling layer (after the last convolutional layer) with a spatial pyramid pooling layer.

In YOLO, the SPP is modified to retain the output spatial dimension. A maximum pool is applied to a sliding kernel of size say, 1×1, 5×5, 9×9, 13×13. The spatial dimension is preserved. The features maps from different kernel sizes are then concatenated together as output. [12]

The diagram below demonstrates how SPP is integrated into YOLO.[12]

Fig: Yolov4 network pathway for input 416x416 [Link]

PANet[8] is used for parameter aggregation from different backbone levels for different scales, instead of the FPN used in YOLOv3.

The authors refrained from using Cross-GPU Batch Normalization (CGBNor SyncBN) or expensive specialized devices. Due to this reproducibility of yolov4 results on a conventional graphic processor e.g. GTX 1080Ti or RTX2080Ti becomes feasible.

Selection of BoF and BoS

The next step is to choose best techniques from BoF(Bag of freebies) and BoS(Bag of Specials). I would highly recommend you to go through these detailed articles on BoS and BoF before going ahead.

The short summary of candidates considered for Yolov4 architecture are as follows:-

Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish
Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
Data augmentation: CutOut, MixUp, CutMix
Regularization method: DropOut, DropPath, Spatial DropOut, or DropBlock
Normalization of the network activations by their mean and variance: Batch Normalization (BN) , Cross-GPU Batch Normalization (CGBN or SyncBN), Filter Response Normalization (FRN), or Cross-Iteration Batch Normalization (CBN)
Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP)[9].

The authors developed the training strategies using these techniques based on wide applicability and feasibility.

PReLU and SELU are difficult to train, and ReLU6 is specifically designed for quantization network, therefore not used for training.

The Drop-Block paper had detailed comparisons and better results over other counterparts and hence was chosen as regularization method.

SyncBN is not considered for the selection of normalization method, since author’s focus was on a single gpu training.

Additional improvements

In order to make the designed detector more suitable for training on single GPU, the author’s made additional design changes and improvement as follows:

They introduce a new method of data augmentation — Mosaic

Mosaic

Mosaic is a new data augmentation strategy that mixes parts from 4 training images into a single training image to learn different context.
CutMix[3] mixes only 2 input images.
It encourages the model to localize different types of object scenes in different portions of the frame.[5]
Additionally, batch normalization calculates activation statistics from 4 different images on each layer. This significantly reduces the need for a large mini-batch size.
This is significant as higher batch size at high input resolution for object detection tasks with on the shelf GPU’s in the market is not feasible.

Other Augmentation strategies like MixUp, CutOut and CutMix were also considered.

Fig : Comparision between MixUp, CutOut and CutMix methods. [3]

Self-Adversarial Training (SAT)

Self-Adversarial Training (SAT) also represents a new data augmentation technique that operates in 2 forward-backward stages.

This technique uses the state of the model to inform vulnerabilities by transforming the input image.[5] He attack than protac :P

In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on itself by distorting the original signal, to create the deception that there is no desired object on the image.
The first stage can be visualized as conversion of original signal to it’s hard form.
In the 2nd stage, the neural network is trained to detect an object on this modified image(hard sample) in the normal way.

There is an entire area of research dedicated to solve issue of generalizing the latent space of neural networks to fight against adverserial attacks which can alter the detections by a mere change of a single pixel. SAT addresses this issue.

Modification of existing methods to make design suitable for efficient training and detection

Cross mini-Batch Normalization (CmBN)

A modified version of CBN(Cross Batch Normalization)[6].
With CBN, the effective number of examples used to compute the statistics for the current iteration is k times as large as that for the original BN.
In training, the loss gradients are backpropagated to the network weights and activations at the current iteration. Those of the previous iterations are fixed and do not receive gradients. Hence, the computation cost of CBN in back-propagation is the same as that of BN.
The main difference in Cross mini-Batch Normalization is collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch.

Fig : Cross Mini Batch Normalization [1]

Modified SAM

Transforms the spatial-wise to point-wise attention scheme by removal of pooling operations in former[7].

Fig : Modified Spatial Attention Module(SAM) [1]

Modified PAN

Replace shortcut connection of PAN[8] to concatenation.

Fig : Modified Path Aggregation Network Module(PAN) [1]

Proposed Architecture

After rigorous review of techniques, the author’s finalized the following architecture for Yolov4.

YOLOv4 consists of

Backbone: CSPDarknet53 [9]
Neck: SPP [10], PAN [8]
Head: YOLOv3 [63]

YOLO v4 uses

Bag of Freebies (BoF) for backbone: CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing
Bag of Specials (BoS) for backbone: Mish activation, Cross-stage partial connections (CSP), Multi-input weighted residual connections (MiWRC)
Bag of Freebies (BoF) for detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground truth, Cosine annealing scheduler, Optimal hyper-parameters, Random training shapes
Bag of Specials (BoS) for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS

Next Article: YOLOv4 — Version 4: Final Verdict
Stay tuned :)

YOLOv4 — Version 4: Final Verdict

An Introductory Guide on the Fundamentals and Algorithmic Flows of YOLOv4 Object Detector

medium.com

Please check out remaining parts of our entire series on Yolov4 on our page VisionWizard.
Looks like you have a real interest in quality research work , if you have reached till here. If you like our content and want more of it, please follow us at VisionWizard.
Thanks for your time. Stay Safe :)