What is the best YOLO?

Pedro Azevedo
6 min readJun 8, 2022

--

YOLOR vs YOLOv5 vs YOLOX vs Scaled-YOLOv4

What is the best YOLO Object Detection? YOLO series State of the Art 2022 explained.

This article is a continuation of my previous two articles: “Object detection State of the Art 2022 “ and “From YOLO to YOLOv4” , be sure to check those out for full context. It is also taken from a Chapter on my master thesis, that is why the references have different numbers. I will link the thesis and the articles bellow.

YOLOv5

YOLOv5 is a family of object detection architectures and models pre-trained on the COCO dataset. Its architecture is similar to YOLOv4; having a Backbone, Neck, Head.

Figure 2.20: YOLOR comparison on MS-COCO Dataset [56].

For the Backbone CSP (Cross Stage Partial Networks) are used (CSP-Darknet-53). [55]. As for the Neck a version PANet is used [29] to extract feature pyramids (they help the model generalize on object scaling (big and small), in this case a CSP-PAN and SPPF are used. The model head is identical to YOLOv3. It uses similar image augmentation techniques including mosaic, copy paste random affines, etc. The authors choose Leaky ReLU and Sigmoid activation function in the middle/hidden layers. There is no official paper for the YOLOv5, however, some authors made some comparisons between YOLOv4 and YOLOv5; the conclusions are close although with the scope of this dissertation inference speed is a big factor and it is highly dependent on hardware, that is why some authors like [39] for their autonomous aerial vehicle opted for YOLOv5 since it provided them with a better mAP with a similar inference speed, the work of [19] directly compared YOLOv4, YOLOv5 and YOLOX. Here different results were obtained with YOLOv5 outperforming YOLOv4-CSP on terms of accuracy with the larger model of 640x640 resolution. In the context of this dissertation, perhaps it is more fair to compare the lighter models where YOLOv4 achieved higher values in terms of accuracy, but YOLOv5 in terms of inference speed. It is not as clear in the literature which one is better as it seems to differ on the data-set, hardware, size of the model and inference speed. A further discussion of these topics will be made in the experiment section of this thesis. With the creation of YOLOv5, significant value is added through its translation of the Darknet research framework to the PyTorch framework. This, combined with the open source development, has made YOLOV5 extremely easy to test with, train and deploy making its usage very friendly compared to other versions of YOLO.

YOLOR

With the huge success of YOLOV4 more papers followed, more specifically YOLOR [56]. The goal of this network is, with the same model, be able to perform multiple tasks such as object detection, instance segmentation and panoptic segmentation with a single network. To achieve this, the authors relied on what they call Explicit and Implicit Knowledge fusing it on a unified model, as seen on Figure 2.19. Explicit knowledge is connected to the shallow layers of the neural network, which is directly correlated with the observations that are made. Implicit knowledge is obtained by features in the deep layers. For the explicit learning, the authors used a DETr (Detection Transform), Non Local Networks and a method of Kernel Selection. For implicit learning, they opted for techniques such as:

  • Kernel Alignment
  • Manifold Space Reduction
  • Offset refinement
  • Anchor refinement

With these techniques, YOLOR achieved similar results in terms of accuracy but an outstanding increase in performance speed compared to the other SOTA models as seen in Figure 2.20 and Table 2.2.

Table 2.2: Comparison between State-Of-The-Art models and YOLOR on COCO test dataset [56].

YOLOX

YOLOX is an anchor-free version of YOLO with a simpler design but better performances. The main difference between this model and traditional YOLO is the anchor-free algorithm conduced together with advanced detection techniques, i.e a decoupled head and SimOTA. This model achieves higher performance than the YOLOv4/v5. A comparison between YOLOX and other algorithms can be seen in Figure 2.21.

Figure 2.21: Comparison of YOLOX with different SOTA algorithms [19].

Currently, additions of an anchor-free system are being made to YOLOR [56] to create an even better model. Since the results in the YOLOX paper were published at a very similar time where YOLOR was published, a fair comparison is not established by the authors at the papers. However, by analyzing the mAP values in the published papers, YOLOR offers higher values on the larger model. YOLOX, however, seems to perform very well on edge devices on smaller model like YOLOX-Tiny and YOLOX-Nano. [19] This version of YOLO seems to be improving, and the authors are currently working on a version with a Swin Transformer-based backbone; another recent paper has shown that this is possible with great results. Currently, at the time of writing this dissertation, the only transformer-based YOLO version is VitYOLO [66], but its performance is very limited and does not compare to the other YOLO versions. However, both YOLOX and YOLOR authors are running experiments with these transformer based systems as it seems to be the direction were the future of computer vision is headed.

Scaled-YOLOv4

Scaled-YOLOv4 [54] offers an extension to the traditional YOLOv4 allowing it to scale effectively maintaining a good performance. First, the authors redesigned YOLOV4 proposing YOLOv4-CSP and later developed the scaled version.This scaling allows it to be used in a range of different applications having a good compromise between speed and accuracy, being able to operate in real time, as well as embedded devices. The authors designed a powerful scaling method for small models that can balance systematically the computation cost and memory bandwith of a shallow CNN; and designed a simple yet effective strategy for scaling a large object detector and analyzed the relations among all model scaling factors.

YOLO Series Summary

All these models and improvements can be hard to grasp and compare so, a small simplified summary, is shown.

Table 2.3: Comparison between different 2D-3D models of object detection [1].

A table comparing some of the models described in this chapter is provided by [1] and shown in Table 2.3. The important thing to keep in mind is that all these inference speed tests are not made on the same hardware and the author gives little descriptions on how these were measured.

Articles:

Part 1: Object Detection State of the Art 2022

https://medium.com/@pedroazevedo6/object-detection-state-of-the-art-2022-ad750e0f6003

Part 2: From YOLO to YOLOv4

Thesis : To be added

--

--

Pedro Azevedo

Masters in University of Aveiro, Portugal. Focus in Deep Learning and Computer Vision for Autonomous Driving