YOLOv4 — Version 1: Bag of Freebies

An Introductory Guide on the Fundamentals and Algorithmic Flows of YOLOv4 Object Detector

Deval Shah

Published in

VisionWizard

8 min readMay 19, 2020

Source: Photo by Joanna Kosinska on Unsplash

Let me present to you, version 1 of mini-series on YOLOv4.

YOLOv4 — Version 0: Introduction
YOLOv4 — Version 1: Bag of Freebies
YOLOv4 — Version 2: Bag of Specials
YOLOv4 — Version 3: Proposed Workflow
YOLOv4 — Version 4: Final Verdict

What is Bag Of Freebies?

The set of techniques or methods that change the training strategy or training cost for improvement of model accuracy is termed as Bag of Freebies.

These are different measures one can take while training offline for improvement in the overall accuracy without increasing the overall inference cost.

Considering the fact, most of the CNN’s are trained offline, this can be applicable to a wide variety of architectures.

You can think of ‘Bag of Freebies’ as a general framework of training strategies for improving the overall accuracy of an object detection model.

In the Yolov4 paper, the Bag of Freebies for object detection training strategies are formulated as follows

Data Augmentation
Semantic Distribution Bias in Datasets
Objective Function of BBox Regression

Each one of them are tackling different issues per say. Let’s dive in each one of them in detail.

Data Augmentation

What is Data Augmentation?

The process to increase the variability in the input images of the data, so that the designed object detection model has higher robustness to the images obtained from different environments.[1]

Widely used data augmentation methods can be categorized as follows

Photometric Distortions Eg: Brightness, Contrast, Hue, Saturation, Noise
Geometric Distortions Eg: random scaling, cropping, flipping, and rotation

The above described methods do pixel wise adjustments in the original image and the original information is retained in the adjusted area.

Using data augmentation to improve adaptability of object detection models is not a novel concept. However, over the years there has been inclusion of new ways of doing it, leading to decent accuracy improvements.

Object Occlusion is an important issue in object detection and classification paradigm. To tackle it, researchers have come up with new ways of data augmentation.

Random Erase and Cut Out : Randomly select the rectangle region in an image and fill in a random or complementary value of zero. [2, 10]
Hide-and-Seek and Grid Mask : They randomly or evenly select multiple rectangle regions in an image and replace them to all zeros. [11, 12]

Similarly, we can apply alike concepts to feature maps.

DropOut : Disabling a set of neurons with a random probability p at any given time in training to avoid over fitting via choosing different pathways in network. [9]
DropConnect : DropConnect which generalizes Dropout by randomly dropping the weights rather than the activations[13]
DropBlock : DropBlock (inspired by Cutout) main difference from dropout is that it drops contiguous regions from a feature map of a layer instead of dropping out independent random units. [5]

Additionally, the researchers also proposed using multiple images for doing data augmentation.

MixUp : Uses two images to multiply and superimpose with different coefficient ratios, and then adjusts the label with these superimposed ratios. [8]
CutMix : It is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. [14]

In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN.

Semantic Distribution Bias

What is semantic distribution bias problem?

Inherent bias is a big concern when it comes to diversity of datasets on which the models are trained. If the distribution of data has a bias issue, it can skew the training towards a non optimum convergence failing to generalize.

The two main issues that lead to semantic distribution bias in dataset are as follows.

Data Imbalance across different classes

This problem is addressed in two stage object detectors like Faster RCNN, Libra RCNN using either hard negative example mining or online hard example mining.
The same is not applicable for one stage object detector because of it’s dense prediction architecture opposed to sparse prediction architecture in two stage object detectors.
For one stage object detectors like Yolov3, SSD, RetinaNet, MobileNet etc., focal loss can be used to deal with problem of class imbalance in data.

Failure to express the relationship of the degree of association between different categories with the one-hot hard representation

This representation scheme is often used when executing labeling of the data.
Smooth labeling helps your training around mislabeled data and adds to the robustness and performance of the model[19].
The label smoothing proposed in [17] is to convert hard label into soft label for training, which can make model more robust.
In order to obtain a better soft label, Islam et al. [18] introduced the concept of knowledge distillation to design the label refinement network.

Objective Function of BBox Regression

The last bag of freebies mentioned in the paper is the objective function of Bounding Box (BBox) regression.

What is objective function?

In object detectors, the objective function also called as loss functions are used to penalize and direct the model towards a better convergence at each training step.

Mean Square Error (MSE): The traditional (one should not use) object detector uses MSE to perform regression on center point coordinates and height and width of the BBox i.e. (xc, yc, w, h) or top-left and bottom-right coordinates(tlx, tly, brx, bry).

Anchor based approaches used corresponding offsets to these points.

When we do regression on individual points, it does not take the integrity/coverage area of an object into account.

Furthermore, the loss increases along with scale of the object which is not ideal.

Researchers proposed IoU loss which considers the coverage area of predicted box area and ground truth area.

IOU is scale invariant and thus it does not increase with the scale of box like MSE.

In recent years, there has been improved versions of IOU based loss functions.

GIoU (Generalized Intersection over Union)

The original IoU loss only addressed the overlapped bounding boxes and will not provide any learning for the non-overlapping cases. To address this issue, the equation takes into account the non-overlapping cases.

Where A and B are the prediction and ground truth bounding boxes. C is the smallest convex hull that encloses both A and B.
The idea of GIoU is to move the predicted box closer to ground truth despite no overlap. In case of simple IoU, it would have simply passed as 0 resulting in no improvement in position of predicted box.
As the IoU component increases, the value of GIoU converges to IoU.[20]

B : Predicted Box ; Bgt : Ground truth box ; C : Smallest Convex Hull between B and Bgt

DIoU(Distance-IoU)

DIoU loss [4], additionally considers the distance of the center of an object as an extra penalty term.
The paper introduces an additional penalty term on IoU loss to directly minimize the normalized distance between central points of two bounding boxes, leading to much faster convergence than GIoU loss as shown in fig below.

The penalty term directly tries to minimize the distance between b which is the center point of the predicted bounding box and bgt which is the center point of the ground truth. They also add denominator c which is the diagonal length of the smallest enclosing box covering the two boxes[21]

Source : Distance-IoU Loss Equation from Paper[4]

CIoU(Complete IOU)

CIoU loss [4], simultaneously considers three metrics: overlapping area, the distance between center points, and the aspect ratio. It is an extension to DIoU loss.

Source : Complete-IoU Loss Equation from Paper[4]

The alpha is a positive trade off parameter where overlaps are given higher priority over non-overlap cases and v gives information about consistency of aspect ratio.[4]

Source : Eq 9 from Paper[4]

Source : Eq 9 from Paper[11]

CIoU can achieve better convergence speed and accuracy on the BBox regression problem.[1]

Next Article: YOLOv4 — Version 2: Bag of Specials.
Stay tuned :)

Please check out other parts our entire series on Yolov4 on our page VisionWizard.
Looks like you have a real interest in quality research work , if you have reached till here. If you like our content and want more of it, please follow us at VisionWizard.
Thanks for your time :)

YOLOv4 — Version 2: Bag of Specials

An Introductory Guide on the Fundamentals and Algorithmic Flows of YOLOv4 Object Detector

medium.com

References

[1]YOLOv4: Optimal Speed and Accuracy of Object Detection CVPR 2020

[2]Random erasing data augmentation.arXiv preprint arXiv:1708.04896, 2017

[3]CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features IEEE/CVF International Conference on Computer Vision (ICCV) 2019

[4]Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression AAAI 2020

[5]DropBlock: A regularization method for convolutional networks NeurIPS2018

[6]Focal Loss for Dense Object Detection IEEE Transactions on Pattern Analysis and Machine Intelligence 2017

[7]Gaussian YOLOv3: An Accurate and Fast Object Detector Using Localization Uncertainty for Autonomous Driving 2019 IEEE/CVF

[8]mixup: Beyond Empirical Risk Minimization ICLR 2018

[9]Dropout: a simple way to prevent neural networks from overfitting J. Mach. Learn. Res.2014

[10]Improved regularization of convolutional neural networks with CutOut.arXiv preprint arXiv:1708.04552, 2017

[11]Hide-and-Seek: A data augmentation technique for weakly-supervised localization and beyond.arXiv preprint arXiv:1811.02545, 2018

[12]GridMask data augmentation.arXiv preprint arXiv:2001.04086, 2020.

[13]Regularization of neural networks using Drop-Connect. In Proceedings of International Conference on Machine Learning (ICML), pages 1058–1066, 2013

[14]CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6023–6032, 2019

[15]ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations (ICLR),2019.

[16]UnitBox: An advanced object detection network. In Proceedings of the 24th ACM international conference on Multimedia, pages 516–520, 2016

[17]Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 2818–2826, 2016

[18]Label refinement network for coarse-to-fine semantic segmentation.arXiv preprint arXiv:1703.00551, 2017

[19]Article on smooth labeling

[20]Article on GIoU

[21]Article on DIoU