The evolution of the YOLO neural networks family from v1 to v7.

Published in

Deelvin Machine Learning

8 min readOct 18, 2022

In previous part we have considered the oldest three architectures: YOLO, YOLOv2, YOLOv3. Today we handle with the next six architectures.

YOLOv4, Scaled YOLOv4

Authors

Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao

Joseph Redmon stepped down from further development of YOLO for ethical reasons.

Main articles

“YOLOv4: Optimal Speed and Accuracy of Object Detection”, https://arxiv.org/pdf/2004.10934.pdf, publication date 2020/04

“Scaled-YOLOv4: Scaling Cross Stage Partial Network”, https://arxiv.org/pdf/2011.08036.pdf, publication date 2020/11

Repositories

1. https://github.com/AlexeyAB/darknet, 20.4k/19.6k, all-permissive license

2. https://github.com/Tianxiaomo/pytorch-YOLOv4, 1.4k/4.1k, Apache-2.0 license

3. https://github.com/WongKinYiu/ScaledYOLOv4, 549/1.9k, GPL-3.0 license

Performance Comparison

Comparison of the proposed YOLOv4 and other state-of-the-art object detectors. YOLOv4 runs twice faster than EfficientDet with comparable performance. Improves YOLOc3’s AP and FPS by 10% and 12%, respectively.

Architectural Features

Let’s take a closer look at the parts that make up v4.

Backbone

In the 4th version, a more powerful CSPDarknet53 network was taken as a backbone than in v3. CSP means the presence of Cross stage partial connections — a type of connection between non-neighboring layers of the network. At the same time, the number of layers remained the same. The SPP module has been added to it.

The structure of CSPDarknet53 (a) and CSPDarknet53-tiny (b)

Neck

Consists of a PANet module. Instead of FPN, it serves for path aggregation, namely for concatenation (instead of summation) of activations from different scales.

Heads

Here, the concept remains the same, with anchors.

In addition to architectural changes, a number of improvements have been made to the learning procedure.

SAT (Self-Adversarial Training) was applied — a method of augmentation consisting of 2 stages. In the first stage, instead of the weights of the network, the picture is modified to such a state when it seems to the network that the desired object is not on it (adversarial attack). In the second stage, the network trains to detect an object in the picture changed in the first stage.

The receptive field was increased and mechanisms of attention were used.

Many additional types of augmentation and class balancing.

Backbone improvements:

for training: CutMix + Mosaic augmentation, DropBlock regularization, class label smoothing
for inference: Mish activation, Cross-stage partial connections (CSP), Multi-input weighted residual connections (MiWRC)

Detector improvements:

for training: CIoU-loss, CmNN, DropBlock, Mosaic, SAT, Eliminate grid sensitivity, multiple anchors for single ground truth, Cosine annealing scheduler, optimal hyperparameters, random shapes during training
for inference: Mish, SPP-block (spatial pyramid-pooling), SAM-block (spatial-attention module), PAN, DIoU-NMS

Updates to network training that do not affect FPS but improve accuracy.

Mosaic represents a new method of data augmentation

Scaled YOLOv4

Six months after the publication of the first article on v4, the authors released another one, in which they released the mechanism for scaling the network architecture. This mechanism includes scaling not only the input resolution, network width and depth, but also scaling the network structure itself.

Comparison of the proposed YOLOv4 and other state-of-the-art object detectors. The dashed line means only latency of model inference, while the solid line includes model inference and post-processing.

The architecture of YOLOv4-large, including YOLOv4-P5, YOLOv4-P6, and YOLOv4-P7. The dashed arrow means replace the corresponding CSPUp block by CSPSPP block.

Advantages

v4 is not only faster and more accurate than competitors but it can also be trained on relatively weak equipment (such as one 1080Ti). For comparison, EfficientDet needs to be trained on v3–32 TPU type (v3) — 32 TPU v3 cores — 512 GiB Total TPU memory to achieve acceptable accuracy.
v4 is built into OpenCV so it can be called directly without darknet
license allows any use without restrictions

Limitations

not found

YOLOv5

Authors

Glenn Jocher

Since this author did not take part in the development of architectures of previous versions of YOLO, but only in the implementation, from an ethical point of view, the legality of using the name “YOLOv5” seems doubtful. There have been quite a few discussions about this on the Internet, but at the moment this name has settled.

Main article

There is still no official article on arxiv.org.

Repositories

1. https://github.com/ultralytics/yolov5, 10.7k/29.8k, GPL-3.0 license

Performance Comparison

Architectural Features

It is a development of v3 (not v4), published almost 2 months after the release of v4.

Performance is better than v3, but worse than v4.

The network architecture of Yolo5. It consists of three parts: (1) Backbone: CSPDarknet, (2) Neck: PANet, and (3) Head: Yolo Layer. The data are first input to CSPDarknet for feature extraction, and then fed to PANet for feature fusion. Finally, Yolo Layer outputs detection results (class, score, location, size).

Augmentation: scaling, color space adjustments, mosaic.

In v5, as in v4, implemented:

CSP bottleneck for features
PANet for feature aggregation

Advantages

well-designed repository with the ability to deploy to mobile and low-power devices.
quick time workout

Limitations

on some tests worse than v4
GPL-3.0 license obliges to disclose the source code

YOLOX

Authors

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun

Megvii Technology, Ltd., China

Main article

“YOLOX: Exceeding YOLO Series in 2021”, https://arxiv.org/pdf/2107.08430.pdf, publication date 2021/07

Repositories

https://github.com/Megvii-BaseDetection/YOLOX, 1.7k/7.2k, Apache-2.0 license

Performance Comparison

Speed-accuracy trade-off of accurate models (left) and Size-accuracy curve of lite models on mobile devices (right) and other state-of-the-art object detectors.

Comparison of the speed and accuracy of different object detectors on COCO 2017 test-dev. All the models are trained on 300 epochs for a fair comparison.

Architectural features

Just like v5, it is not an official development of the architecture.

The model is based on YOLOv3-Darknet53.

From innovations:

decoupled head: conflict between classification and regression problems resolved by splitting branches

Illustration of the difference between YOLOv3 head and the proposed decoupled head. For each level of FPN feature, we first adopt a 1x1 conv layer to reduce the feature channel to 256 and then add two parallel branches with two 3x3 conv layers each for classification and regression tasks respectively. The IoU branch is added to the regression branch.

Augmentation: mosaic, mixup, random horizontal flip, colorjitter.
It turned out that preliminary training on ImageNet did not give any advantages, so all models were trained from scratch.
Anchorless detector. Anchors have their own problems — eg. the need for preliminary cluster analysis to determine the optimal anchors. Anchors also increase the complexity of the detecting head and the number of predictions per image. Getting rid of anchors lowered GFLOPs and increased mAP.
Multi positives. In the case of no anchors, only one positive sample should be selected from the entire picture, which can cause other high-quality predictors to be ignored. However, the use of such predictors can produce useful gradients that reduce the imbalance of positive and negative sampling during training. Therefore, YOLOX has a 3x3 area in the center for positive sampling, and this also improves the accuracy of the network.
SimOTA. Advanced markup assignment (the markup assignment defines positive and inactive sampling for each ground truth object). A special algorithm for selecting pairs of samples speeds up learning.
Other features: exponential moving averages for updating weights, cosine lr schedule, IoU loss for regression branch, BCE-loss for class branch, SGD optimizer.

Roadmap of YOLOX-Darknet53 in terms of AP(%) on COCO validation. All the models are tested at 640x640 resolution with FP16-precision and batch=1 on a Tesla V100. The latency and FPS in this table are measured without post-processing.

Advantages

detection accuracy at the time of release is higher than that of competitors
detection rate at the time of release is higher than that of competitors
Apache-2.0 open license

Limitations

not found

PP-YOLOv1/v2/E

Authors

Large list of authors from Baidu Inc.

Main articles

“PP-YOLO: An Effective and Efficient Implementation of Object Detector”, https://arxiv.org/pdf/2007.12099.pdf, publication date 2020/07
“PP-YOLOv2: A Practical Object Detector”, https://arxiv.org/pdf/2104.10419.pdf, publication date 2021/04
“PP-YOLOE: An evolved version of YOLO”, https://arxiv.org/pdf/2203.16250.pdf, publication date 2022/03

Repository

https://github.com/PaddlePaddle/PaddleDetection, 2.1k/8.3k, Apache-2.0 license

The model is an unofficial development of the family from the Chinese company Baidu, written on its Parallel Distributed Deep Learning (PADDLE) framework.

Performance Comparison

Architectural Features

PP-YOLO

The authors did not look for suitable backbones, nor various augmentations, nor did they optimize hyperparameters through NAS.

Instead of Darknet-53, they took the usual ResNet50-vd as a backbone, and replaced some convolutional layers with deformable convolution layers. For augmentation, the basic MixUp was used.

The network architecture of YOLOv3 and inject points for PP-YOLO. Activation layers are omitted for brevity.

Tricks:

increased batch from 64 to 192, corrected LR accordingly;
exponential moving average (EMA) for network weights: lambda = 0.9998 — scatter factor.

DropBlock is a dropout variant in which featuremap regions are dropped together. Applied only to FPN, because the use for the backbone led to a drop in performance.
IoU Loss — in YOLOv3 there was L1, but this is not the most effective loss for bboxes. The authors used the base IoU loss.
IoU Aware — in YOLOv3, for the confidence value, the class probability was multiplied by the objectness value, which did not improve localization accuracy. To solve this problem, an IoU prediction branch has been added to measure the localization accuracy. When training IoU aware, the loss participates in the training of the IoU prediction branch. With inference, the predicted IoU is multiplied by the class probability and objectivity, which gives better localization accuracy. In this case, the computational complexity practically does not increase.
Grid sensitive
Matrix NMS
CoordConv — additional channels to convolutions SPP (Spatial Pyramid Pooling)
ImageNet pretrained distilled model

The ablation study of tricks on the MS-COCO minival split.

PP-YOLOv2

Changes compared to PP-YOLO:

FPN replaced with PANet
Mish activation
increased the input image size
changed IoU aware loss:

The ablation study of refinements on the MS-COCO minival split. ‘+’ indicates the result includes bounding box decode time (1–2ms)

The architecture of PP-YOLOv2’s detection neck.

Tried but didn’t work:

Cosine learning rate decay
freezing the backbone weights during retraining gives a lower mAP

PP-YOLOE

Improvements:

anchorless
CSPRepResNet backbone
Task Assignment Learning — an algorithm for efficient selection of samples for batches
Efficient Task-aligned Head (ET-head), an alternative to decoupled head loss. VFL = varifocal loss, DFL = distribution focal loss