YOLOv4 — Version 3: Proposed Workflow
An Introductory Guide on the Fundamentals and Algorithmic Flows of YOLOv4 Object Detector
Welcome to the mini-series on YOLOv4. This article will be addressing the overall algorithm and core components proposed in final design of Yolov4 that gives optimum convergence between speed and accuracy.
YOLOv4 — Version 0: Introduction
YOLOv4 — Version 1: Bag of Freebies
YOLOv4 — Version 2: Bag of Specials
YOLOv4 — Version 3: Proposed Workflow
YOLOv4 — Version 4: Final Verdict
Yolov4 is built on the principles for achieving the best possible interleaved combination of Speed, Accuracy and Parallel Computation.
Without sacrificing either, it manages to surpass the yolov3 and other real time object detectors by using optimal amalgamation of techniques from BoF (Bag of Freebies) and BoS(Bag of Specials) along with architecture design that results into an efficient real time object detector.
Authors designed the overall architecture from a GPU training perspective which is unheard of, making it reliable to reproduce the results across conventional GPU skews.
I believe this is a step in right direction as sometimes researchers are so invested behind their idea, they forget the mass applicability of their work altogether.
They optimized the model architecture for efficient use of parallel computational power of conventional GPU(s) which is not the case with other object detectors.
Selection Architecture
The author’s objective is to find the optimal balance among the
- Input network resolution,
- Convolutional layer number,
- Parameter number (filter_size² * filters * channel / groups),
- Number of layer outputs (filters).
The authors performed several ablation studies using two backbones
- CSPDarknet53[9]
- CSPResNext50[9]
A major finding from these studies revealed the superiority of CSPDarknet53 in COCO object detection tasks while CSPResNext50 in object classification on the ILSVRC2012 (ImageNet) and thus chosen as the backbone for yolov4.
In yolov4 architecture, there are two core components.
- Dense Blocks
- Cross Stage Partial Networks (CSP)
- Path Aggregation Network (PANet)
- Spatial Pyramid Pooling (SPP)
For in-depth explanation of above mentioned components, please visit link. We covered this in the Bag Of Specials version of this mini-series.
Dense Blocks
A Dense Block contains multiple convolution layers with each layer Hi composed of batch normalization, ReLU, and followed by convolution. It increases the receptive field of the entire network.
Instead of using the only last layer output, Hi takes the output of all previous layers as well as the original as its input. i.e. x₀, x₁, …, and xᵢ₋₁. Each Hi below outputs four feature maps. Therefore, at each layer, the number of feature maps is increased by four — the growth rate.[13]
Cross-Stage-Partial-Networks(CSP)
CSPNet separates the input feature maps of the DenseBlock into two parts.
- The first part x₀’ bypasses the DenseBlock and becomes part of the input to the next transition layer.
- The second part x₀’’ will go thought the Dense block as below.
This new design reduces the computational complexity by separating the input into two parts — with only one going through the Dense Block.
CSPDarknet53
YOLOv4 utilizes the CSP connections above with the Darknet-53 below as the backbone in feature extraction.
The Yolov4 backbone design gave priority to larger receptive fields and best feature aggregation techniques.
The receptive field in a convolutional neural network refers to the part of the image that is visible to one filter at a time. It increases linearly as we stack more convolutional layers or increases exponentially when we stack dilated convolutions.
Feature aggregation is accommodation of features from different levels in the backbone to account for varied scales of object in the scene.
The backbones are task specific. Opposed to classification, detection tasks has specific requirements for better accuracy/speed.
Higher input network size (resolution) for detecting multiple small-sized objects
- More layers for a higher receptive field to cover the increased size of input network
- More parameters for greater capacity of a model to detect multiple objects of different sizes in a single image.
Ablation results from Fig 2 clearly outlines CSPDarknet53[9] as superior from the rest when it comes to object detection task. It has more parameters(27.6M) and a larger receptive field size(725x725).
For extensive architecture details and pretrained versions of CSPDarknet53, please refer link
SPP(Spatial Pyramid Matching)[10] block over the CSPDarknet53 has three major advantages
- Significantly increases the receptive field,
- Separates out the most significant context features and
- Causes almost no reduction of the network operation speed.
SPP applies a slightly different strategy in detecting objects of different scales. It replaces the last pooling layer (after the last convolutional layer) with a spatial pyramid pooling layer.
In YOLO, the SPP is modified to retain the output spatial dimension. A maximum pool is applied to a sliding kernel of size say, 1×1, 5×5, 9×9, 13×13. The spatial dimension is preserved. The features maps from different kernel sizes are then concatenated together as output. [12]
The diagram below demonstrates how SPP is integrated into YOLO.[12]
PANet[8] is used for parameter aggregation from different backbone levels for different scales, instead of the FPN used in YOLOv3.
The authors refrained from using Cross-GPU Batch Normalization (CGBNor SyncBN) or expensive specialized devices. Due to this reproducibility of yolov4 results on a conventional graphic processor e.g. GTX 1080Ti or RTX2080Ti becomes feasible.
Selection of BoF and BoS
The next step is to choose best techniques from BoF(Bag of freebies) and BoS(Bag of Specials). I would highly recommend you to go through these detailed articles on BoS and BoF before going ahead.
The short summary of candidates considered for Yolov4 architecture are as follows:-
- Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish
- Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
- Data augmentation: CutOut, MixUp, CutMix
- Regularization method: DropOut, DropPath, Spatial DropOut, or DropBlock
- Normalization of the network activations by their mean and variance: Batch Normalization (BN) , Cross-GPU Batch Normalization (CGBN or SyncBN), Filter Response Normalization (FRN), or Cross-Iteration Batch Normalization (CBN)
- Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP)[9].
The authors developed the training strategies using these techniques based on wide applicability and feasibility.
PReLU and SELU are difficult to train, and ReLU6 is specifically designed for quantization network, therefore not used for training.
The Drop-Block paper had detailed comparisons and better results over other counterparts and hence was chosen as regularization method.
SyncBN is not considered for the selection of normalization method, since author’s focus was on a single gpu training.
Additional improvements
In order to make the designed detector more suitable for training on single GPU, the author’s made additional design changes and improvement as follows:
They introduce a new method of data augmentation — Mosaic
Mosaic
- Mosaic is a new data augmentation strategy that mixes parts from 4 training images into a single training image to learn different context.
- CutMix[3] mixes only 2 input images.
- It encourages the model to localize different types of object scenes in different portions of the frame.[5]
- Additionally, batch normalization calculates activation statistics from 4 different images on each layer. This significantly reduces the need for a large mini-batch size.
- This is significant as higher batch size at high input resolution for object detection tasks with on the shelf GPU’s in the market is not feasible.
Other Augmentation strategies like MixUp, CutOut and CutMix were also considered.
Self-Adversarial Training (SAT)
Self-Adversarial Training (SAT) also represents a new data augmentation technique that operates in 2 forward-backward stages.
This technique uses the state of the model to inform vulnerabilities by transforming the input image.[5] He attack than protac :P
- In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on itself by distorting the original signal, to create the deception that there is no desired object on the image.
- The first stage can be visualized as conversion of original signal to it’s hard form.
- In the 2nd stage, the neural network is trained to detect an object on this modified image(hard sample) in the normal way.
There is an entire area of research dedicated to solve issue of generalizing the latent space of neural networks to fight against adverserial attacks which can alter the detections by a mere change of a single pixel. SAT addresses this issue.
Modification of existing methods to make design suitable for efficient training and detection
Cross mini-Batch Normalization (CmBN)
- A modified version of CBN(Cross Batch Normalization)[6].
- With CBN, the effective number of examples used to compute the statistics for the current iteration is k times as large as that for the original BN.
- In training, the loss gradients are backpropagated to the network weights and activations at the current iteration. Those of the previous iterations are fixed and do not receive gradients. Hence, the computation cost of CBN in back-propagation is the same as that of BN.
- The main difference in Cross mini-Batch Normalization is collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch.
Modified SAM
- Transforms the spatial-wise to point-wise attention scheme by removal of pooling operations in former[7].
Modified PAN
- Replace shortcut connection of PAN[8] to concatenation.
Proposed Architecture
After rigorous review of techniques, the author’s finalized the following architecture for Yolov4.
YOLOv4 consists of
- Backbone: CSPDarknet53 [9]
- Neck: SPP [10], PAN [8]
- Head: YOLOv3 [63]
YOLO v4 uses
- Bag of Freebies (BoF) for backbone: CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing
- Bag of Specials (BoS) for backbone: Mish activation, Cross-stage partial connections (CSP), Multi-input weighted residual connections (MiWRC)
- Bag of Freebies (BoF) for detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground truth, Cosine annealing scheduler, Optimal hyper-parameters, Random training shapes
- Bag of Specials (BoS) for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS
Next Article: YOLOv4 — Version 4: Final Verdict
Stay tuned :)
Please check out remaining parts of our entire series on Yolov4 on our page VisionWizard.
Looks like you have a real interest in quality research work , if you have reached till here. If you like our content and want more of it, please follow us at VisionWizard.
Thanks for your time. Stay Safe :)
References
[1]YOLOv4: Optimal Speed and Accuracy of Object Detection CVPR 2020
[2]Random erasing data augmentation. 2017
[3]CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features IEEE/CVF International Conference on Computer Vision (ICCV) 2019
[4]DropBlock: A regularization method for convolutional networks NeurIPS2018
[5] Article of Data Augmentation in Yolov4
[6]Cross-Iteration Batch Normalization
[11]Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement, 2018
[12] Article on yolov4
[14] DC-SPP-YOLO: Dense Connection and Spatial Pyramid Pooling Based YOLO for Object Detection