The evolution of the YOLO neural networks family from v1 to v7.

Maxim Ivanov
Deelvin Machine Learning
7 min readNov 14, 2022

In the previous parts (part 1, part 2) of the article, we reviewed the first 9 architectures of the YOLO family. In this final article, we will look at the 3 latest architectures at the moment.

YOLOR

Authors

Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao (Taiwan).

Main article

“You Only Learn One Representation: Unified Network for Multiple Tasks”, https://arxiv.org/pdf/2105.04206.pdf, published date 2021/05.

Repository

https://github.com/WongKinYiu/yolor, 502/1.8k, GPL-3.0 license.

Performance Comparison

This time the name is deciphered differently, You Only Learn One Representation. The authors are not related to previous versions of YOLO, in YOLOR the concept is also somewhat different from YOLO.

There is implicit knowledge (a generalization of previous experience) and explicit knowledge (perceived through the senses). Therefore, people who understand what is shown in the picture process them much better than ordinary neural networks that do not understand.

Convolutional neural networks usually perform one specific task, while they can be trained to solve several tasks at the same time, which is the goal of YOLOR. Convolutional networks are often created to solve a single problem. While they are learning to parse the input to get the output, YOLOR is trying to force the convolutional network to do two things:

  1. learn how to get output
  2. trying to determine what all the different outputs might be.

Instead of one exit, it may have many.

YOLOR tries to combine explicit and implicit knowledge. With regard to neural networks, their explicit knowledge is stored in layers close to the input, and implicit knowledge is stored in more distant ones. Thus, YOLOR becomes a unifying neural network.

Architectural Features

Architecture of YOLOR

The article describes the key points of the process of integrating implicit and explicit knowledge in neural networks.

  1. Kernel space alignment, prediction refinement, and multi-task learning have been introduced into the process of learning implicit knowledge.
  2. Vector, neural network, and matrix factorization are the methods used to model implicit knowledge and to analyze its effectiveness.

Advantages

  • detection accuracy at the time of release is higher than that of competitors
  • detection rate at the time of release is higher than that of competitors

Limitations

  • GPL-3.0 license obliges to disclose the source code

YOLOv6 aka MT-YOLOv6

Authors

A team of authors from Meituan, China.

It is also not an official development of the family.

Main article

Originally, there was an official article only on the blog of the Chinese company Meituan:

https://tech.meituan.com/2022/06/23/yolov6-a-fast-and-accurate-target-detection-framework-is-opening-source.html, publication date 2022/06

“YOLOv6: A Single-Stage Object Detection Framework for Industrial
Applications”, https://arxiv.org/pdf/2209.02976.pdf, publication date 2022/09.

Repository

https://github.com/meituan/YOLOv6, 550/3.8k, GPL-3.0 license.

Performance Comparison

Architectural Features

Improvements of the v6 focuses on three main areas:

  1. design of the backbone and the neck part is optimized for the hardware
  2. forked head for greater accuracy
  3. effective training strategies

Backbone and neck design

The idea is to take advantage of hardware aspects like computational features of processor cores, memory bandwidth, etc. for effective inference.

EfficientRep Backbone
Rep-Pan used in the neck

To do this, the authors redesigned the neck part of the architecture and the backbone by using the Rep-Pan and EfficientRep blocks, respectively.

Experiments carried out by the Meituan team have shown a significant reduction in computation latency and detection accuracy. In particular, compared to the YOLOv5-nano model, YOLOv6-nano was 21% faster and showed 3.6% higher accuracy.

Decoupled head

The forked head first appeared in v5. It is intended for separate calculation of the classifying part of the network and the regression part. In v6 this approach has been improved.

Efficient Decoupled Head

Effective training strategies

These strategies include:

  • anchorless paradigm
  • SimOTA markup policy
  • loss for SIoU bbox regression
Ablation study

Advantages

  • detection accuracy at the time of release is higher than that of competitors
  • detection rate at the time of release is higher than that of competitors
  • uses the standard PyTorch framework

Limitations

  • GPL-3.0 license obliging to disclose source code

YOLOv7

Authors

Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao.

The team of authors is the same as that of YOLOv4, so v7 can be considered the official development of the family.

Main article

“YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors”, https://arxiv.org/pdf/2207.02696.pdf, publication date 2022/07.

Repository

https://github.com/wongkinyiu/yolov7, 870/4.6k, GPL-3.0 license.

Performance Comparison

Comparison with other real-time object detectors, proposed methods achieve state-of-the-art performance.

Architectural Features

The backbone’s main computing unit is E-ELAN (Extended Efficient Layer Aggregation Network)

It was designed taking into account the following factors that affect the accuracy and speed of calculations:

  • memory access cost
  • I/O ratio
  • element-wise operations
  • activation
  • gradient path

Model scaling. Different applications require different models. In some cases, the detection accuracy is more important — then the model should have more trainable parameters. In other cases, speed is more important, and then the model should be smaller so that the inference passes faster.

When scaling v7, the following hyperparameters are considered:

  • input resolution
  • width (number of channels)
  • depth (number of layers)
  • cascades (number of feature pyramids)

The figure below shows an example of a synchronous model scaling.

The nuances of training

The article discusses a set of methods that can improve the performance of the model without increasing the cost of its training.

Re-parametrization is a technique that is applied after training to improve the model. It increases the training time but improves the results of inference. There are two types of re-parameterization, model level and module level.

Model re-parameterization can be done in two ways:

  • using different training data but the same settings, train multiple models. Then average their weights to get the final model.
  • average weights of models over different epochs.

Modular re-parametrization is more commonly used in research. In this method, the model training process is divided into a large number of modules. The outputs are ensembled to get the final model.

Fine loss for the leading head and rough for the additional.

In v7 architectures, there can be multiple heads to perform different tasks. Accordingly, each head has its own loss. Label assigner is a mechanism that considers network predictions along with ground truth predictions and assigns soft labels. It generates soft and coarse markup instead of generating hard one.

Lead guided assigner (left) and Coarse-to-fine lead guided assigner (right)

Advantages

  • detection accuracy at the time of release is higher than that of competitors
  • detection rate at the time of release is higher than that of competitors
  • uses the standard PyTorch framework

Limitations

  • GPL-3.0 license obliging to disclose source code

Conclusions

If we reduce the evolution of the family into one table, we get the following:

Of course, the table does not mention all the improvements and findings that improve performance. However, as the family develops, some patterns can be seen.

The Backbone initially consisted of one branch (GoogLeNet, VGG, Darknet), then there was a transition to architectures containing skip-connections (Cross-Stage Partial connections — CSPDarknet, CSPRepResNet, Extended-ELAN). Obviously, the presence of such connections provides an advantage over their absence.

The Neck also initially consisted of one branch, and then branched out in the form of various modifications of the Feature Pyramid Network, which allows maintaining the accuracy of object detection at the different scales.

The Head: in earlier versions there was only one head, it contained in one branch of the network all the output parameters — class, objectivity, coordinates of bboxes. In the future, it turned out that it would be more efficient to separate them into separate heads. There was also a shift from the anchor-based paradigm to anchor-free (with the exception of v7 — for some reason there are still anchors in it, and it would be interesting to implement an anchor-free v7 and compare the results with the anchored one). Perhaps, it is better to refuse anchors, if only because this is a kind of adjustment to the training dataset.

The Augmentation: Early augmentations like affine transforms, HSV jitter and exposure changes are quite simple and don’t change the background or environment of the object. More recent ones — MixUp, Mosaic, CutOut, etc. — are also more intelligent, because they change the matter of the image, not just its form. It seems that both directions (classical and modern) of augmentations in a balanced ratio are important for the effective training of neural networks.

In conclusion, I would like to give a complete diagram of the evolution of the family:

Thanks for reading all of this!

--

--