From YOLO to YOLOv4

8 min readJun 8, 2022

YOLO Object detection explained

YOLO object detectors have become the State of the Art when it comes to real-time detections. There are multiple YOLOS and all their names can get a little confusion. This blog post aims to give the reader a brief introduction to the YOLO Series and is a continuation of the previous Blog Post on Object Detection SOTA. https://medium.com/@pedroazevedo6/object-detection-state-of-the-art-2022-ad750e0f6003

The YOLO Series

We will first start by introducing the Original YOLO and go from there. This article is based on my master thesis so I will leave all the references number as they are and you can check out each reference on the thesis linked bellow!

The Original YOLO.

Authors of YOLO [46] proposed a different approach to existing object detectors at the time. Although the models increased in accuracy each year, object detectors lacked speed that allowed them to perform in real time. YOLO presented a new approach, instead of re-purposing a classifier to perform object detection, YOLO [46] instead framed object detection as a regression problem to spatially separated bounding boxes with associated class probabilities. By doing so the entire object detection pipeline was turned into a single network that could be optimized end-to-end directly on detection performance (instead of having to maximize a classifier then perform classification with it). YOLO, however, suffered from a lot of limitations specially dealing with small objects. Due to their size it is hard to separate them when they are present in groups. YOLO works like a grid, and each grid can only detect a single object, so when objects are presented in groups, it becomes harder to detect them. It also sacrifices accuracy for inference speed so it is not as accurate as some of the other models that included the existing SOTA. YOLO divides images into regular grids, as seen on Figure 2.11, and performs detection and localization on those very same grids (Residual Blocks). These grids return three things; the bounding box coordinate with respect to their cell coordinates, the object label, and the probability that the object is in the cell grid [46].

Figure 2.11: YOLO [46] divides the image into an S × S grid and for each grid cell predicts B bounding boxes.

This makes the algorithm fast and lowers computation cost since detection and recognition is made in a single shot by the cells. One drawback of this approach is that it creates duplicate predictions because each cell predicts a bounding box. So, many times the same object is being predicted with multiple different bounding boxes. All this “noise” is passed through a Non-Maximal suppression algorithm, suppressing the bounding boxes that have lower probability scores. In summary, it divides the image into grids of equal size, performs object detection and classification, and eliminates noise with Non-maximal suppression. [4] That is, choosing the highest probability score and suppressing all the bounding boxes having the largest IoU with the current high probability bounding box, then it repeats these stages in a loop until the final bounding boxes are obtained as seen on figure 2.12.

Figure 2.12: Example of Non-maximal suppression [4].

The architecture consists of three key components: the head, neck, and backbone. The backbone is the part of the network made from convolutional layers to extract key features from an image. These first layers are trained on a large data-set with low resolution like ImageNet, then the neck uses those features with fully connected layers to make predictions on probabilities and bounding boxes. Finally, there is the head of the detector that can be interchanged with other layers with the same input shape for transfer learning.

YOLOv2

Batch normalization is a technique used to train neural networks that prevents the weights in the network from becoming imbalanced with extremely high or extremely low values, since it adds normalization in the gradient process itself. This helps solve problems of exploding gradient and vanishing gradient, it also increases training speed and reduces the ability of outlying large weights that will over influence the training process. To fight the original YOLO low performance on small objects in groups, YOLOV2 introduces batch normalization; this lead to improvements in convergence while eliminating the need for other forms of regularization. And on top of this it also introduces anchors. In the original YOLO if more than one object is located within the cell YOLO would not be able to classify them both since one cell is only able to perform one classification. On YOLOV2 [47] a single cell can perform multiple predictions since it predicts 5 bounding boxes for each cell.

YOLO900

YOLOv2 was trained on COCO a dataset with 80 classes in very diverse scenarios that became the standard metric to compare object detection models. In order to expand the number of classes YOLOv2 could detect the authors of YOLO9000 [47] used labels from both ImageNet and COCO, merging the classification and detection tasks to only perform detection. It makes use of hierarchical classification where classes and their sub-classes are represented in a tree-node based format. It provides a lower mAP than YOLOV2, but can detect more than 9000 classes, making it a powerful algorithm.

YOLOv3

YOLOv3 [48] seeks to improve YOLOV2’s work by implementing modern CNNs that use residual networks and skip connections. It used DarkNet-53 instead of DarkNet-19 as a backbone. This architecture allows it to predict at three different scales, having feature maps extracted from these layers. This further increases the performance of YOLOV2 to detect smaller objects. YOLOv3 predicts 3 bounding boxes per cell (compared to YOLOv2 5) but these are made at 3 different scales, so multiplying, it adds up to a total of 9 anchor boxes.

YOLOv4

YOLOv4 adds weighted residual connections, cross mini batch normalization , cross stage partial connections, self adversarial training and mish activation function to the modern methods of regularization and data augmentation. YOLOv4 authors initially considered the following Backbones:

CSPResNext50
CSPDarknet53
EfficientNet-B3

Table 2.1: Initial Yolov4 Backbone considerations [3].

The first two are cross stage partial networks (CSP) based on DenseNet [24]. DenseNet aimed to solve the problem of the vanishing gradient by establishing extra partial connections between layers, as seen on Figure 2.13 , to bolster feature propagation and to encourage the network to reuse features (reducing the number of parameters)

Figure 2.13: DenseNet Architecture [24].

EfficientNet [52] was created by Google Brain to study the scaling of ConvNets (depth, input size, width etc…) as seen in Figure 2.14. This network managed to outperform other networks of comparable size on image classification at the time. However, in the scenario of object detection, the authors of YOLOv4 opted to implement CSPDarknet53 for the backbone network. To use the features provided by the backbone, YOLOv4 neck proposes the use of PAN for feature aggregation and an additional SPP block to increase the receptive field (region of space that a neuron or unit is exposed to in the input data) and separate out the most important features from the backbone. A similar mechanism can be seen in Figure 2.15

Figure 2.15: Efficient Det FPN. In the case of YOLOV4 each of those entries P’ would refer to one layer of the Neural Network [37].

The YOLOv4 head deploys the same head as YOLOv3 with the anchor-based detection steps and three levels of detection granularity. YOLOv4 adopts what the authors call a “Bag of Freebies” that is a bunch of changes that directly improve the performance of the network without adding inference time in production. Most of these changes have to do with data augmentation. Since many of these techniques were already well known to the computer vision community, the main contributions were related to the mosaic data augmentation as seen on Figure 2.16. This type of augmentation tiles together four images, improving the ability of the model to detect smaller objects. Aside from this type of augmentation, the authors also added Self-Adversarial Training (SAT), which seeks to find the portion of the image that the network most relies on during training. When designing a neural network (or any type of object detector), a compromise between inference speed and model complexity should be made. With this in mind, the authors have provided what they call “Bag of Specials”, these significantly increase

Figure 2.16: YOLOv4 Image augmentation examples [3].

performance time while adding marginal increases to inference time, so they were deemed worth the trade-off. One of the changes made concerns the activation functions. Due to the nature of the YOLOv4 architecture while passing the features from one layer to another, the problem of vanishing gradient becomes relevant.

This makes it difficult to pass feature creations to their optimal input. As an alternative, the authors have suggested the use of the Mish activation function. This function a more smooth profile compared to ReLU as seen in Figure 2.17 and Figure 2.18 [38]. Given by the equation:

Figure 2.18: Comparison between Mish and various activation functions [38].

To separate the predicted bounding boxes DIoU NMS is used. This differs from the original NMS since it considers not only the overlap area but also the central point distance between boxes. This improves the algorithm performance for cases with occlusion, since only using the overlap area produces false suppressions for cases with occlusion [68]. For batch normalization the authors use Cross mini-Batch Normalization and finally they use DropBlock regularization; With the DropBlock, some sections of the image are hidden from the first layer, forcing the network to learn additional features.

Now… The next stage is to talk about YoloV5 and the rest of the YOLO series from 2021 onwarwd. This will be discussed in a later blog post but here is a brief summary.

As said before this blog post is taken out of my masters thesis so all the refferences are linked to the numbers used there. I will leave a link to the thesis once it is published.

Continuation:

This article is part of the 3 Part articles I’m writting on Object Detection and YOLO series

Part 1: Object Detection State of the Art 2022

https://medium.com/@pedroazevedo6/object-detection-state-of-the-art-2022-ad750e0f6003

Part 2: From YOLO to YOLOv4

https://medium.com/@pedroazevedo6/from-yolo-to-yolov4-3dcba691d96a

Part 3: What is the Best YOLO?

https://medium.com/@pedroazevedo6/what-is-the-best-yolo-8526b53414af

From YOLO to YOLOv4

Written by Pedro Azevedo