The evolution of the YOLO neural networks family from v1 to v7.

Maxim Ivanov
Deelvin Machine Learning
11 min readOct 4, 2022

If you need a fast object detector, then the neural network models of the YOLO family are de facto a standard today.

There are also a large number of other excellent models for solving the detection problem, but we will not touch upon them in this review.

At the moment, a fairly large number of articles have already been written that analyze the features of individual versions of YOLO. The purpose of this article is a comparative analysis of the entire family. We want to take a look at the evolution of the architecture so we can better understand where things have evolved, what developments have improved performance, and maybe imagine where things are going.

Before the advent of YOLO, the main approach to detecting objects in an image was to sequentially pass through parts of the original image using a sliding window of various sizes so that the classifier shows which part of the image contains which object. The approach is logical but very sluggish.

A little later, a special part appeared that exposed the regions of interest — some assumptions where something interesting could be on the image. But even they were still too numerous, thousands. The fastest of the algorithms, Faster R-CNN, processed one picture on average equipment in 0.2 seconds, which gives us 5 frames per second. In general, everything was rather sad until a fundamentally new approach appeared.

What was the novelty?

In previous approaches, each pixel of the original image could be processed by the neural network several hundred or even thousand times. And each time these pixels were passed through the same neural network, going through the same calculations. Is it possible to do something so as not to repeat the same calculations?

It turned out that it is possible. But for this we had to reformulate the problem slightly. If earlier it was a classification task, now it has become a regression task.

YOLO aka YOLOv1

Let us consider the very first YOLO model, also known as YOLOv1.

Authors

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

Main article

“You Only Look Once: Unified, Real-Time Object Detection”, publication date 2015/06

Repositories

In addition to the official one based on the darknet framework, there is a large number of implementations of various popularity made on other common frameworks.

  1. https://pjreddie.com/darknet/yolov1/
  2. https://github.com/thtrieu/darkflow, 2.1k forks / 6k stars, GPL-3.0 license
  3. https://github.com/gliese581gg/YOLO_tensorflow, 670 forks / 1.7k stars, non-commercial license
  4. https://github.com/hizhangp/yolo_tensorflow, 455 forks / 784 stars
  5. https://github.com/nilboy/tensorflow-yolo, 331 forks / 780 stars
  6. https://github.com/abeardear/pytorch-YOLO-v1, 214 forks / 473 stars, MIT license
  7. https://github.com/dshahrokhian/YOLO_tensorflow, 22 forks / 42 stars

Performance comparison

Real-Time Systems on Pascal VOC 2007. Comparing the performance and speed of fast detectors. Fast YOLO is the fastest detector on record for Pascal VOC detection and is still twice as accurate as any other real-time detector. YOLO is 10 mAP more accurate than the fast version while still well above real-time in speed.

Architectural features

Structurally, YOLO models consist of the following parts:

  1. Input — an input layer to which an input image is fed
  2. Backbone — a part in which the input image is encoded in the form of features.
  3. Neck — here are additional parts of the model that process images encoded by features
  4. Head(s) — one or more output layers that produce model predictions.

The first version of the network is based on the architecture of GoogLeNet. It is a cascade of convolutional layers interleaved with MaxPool. The cascade ended with two fully connected layers.

Also, the authors trained a faster version of the Fast YOLO architecture, containing fewer convolutional layers (9 instead of 24). The input resolution of both models was 448x448, but the pre-training of the main part of the network was like a classifier at a resolution of 224x224.

In this architecture, the original picture is divided into S x S cells (in the original 7 x 7), and each cell predicted B bounding boxes and degrees of confidence for the presence of any object in these bboxes, as well as the probability for C classes. The number of cells on each side is odd, so that there is one cell in the center of the image. This has an advantage over an even number because there is often one main subject in the center of the photo, in which case the main predictions are made in the center cell. In case of an even number of cells, the center could be somewhere in the four central cells, which reduced the network’s confidence level.

The confidence value means how confident the model is that the given bbox contains some object and how accurately, in its opinion, the bbox predicts its position. In fact, this is the product of the probability of the presence of an object by IoU (truth, pred). If there is no object in the cell, then the confidence is zero.

Each bbox consists of 5 numbers: x, y, w, h, and confidence. (x, y) — coordinates of the bbox-a center inside the cell, w and h — the width and height of the bbox-a in relation to the dimensions of the whole picture, i.e. are normalized and have values ​​from 0 to 1. Confidence is the IoU between bbox predicted and the true one. Each cell also predicts C conditional probabilities of the object’s class. Only one set of classes per cell is predicted, regardless of the number of bboxes B.

Thus, in one pass, S*S*B bounding boxes were predicted. Most of them had a low degree of confidence, however, by setting a certain threshold, we can get rid of a significant part of them. But the most important thing is that the detection rate at the same time (compared to competitors) has increased by orders of magnitude. This is quite logical, because all bboxes for all classes are now predicted in just one pass — You Only Look Once. For different implementations, the original article gives numbers from 45 to 155 (!) FPS on the Titan X GPU.

And although the accuracy of mAP compared to previous algorithms still fell a little bit, in some problems real-time detection is more important.

Getting bboxes.

Because cells adjacent to the center of the object can also produce bboxes, which leads to their excess, it is necessary to choose the best of them. For this, non-maximal suppression technology is used, which works as follows. All bboxes for this class are taken from the picture. Those where confidence is less than a given threshold are discarded. For the rest, the procedure of pairwise comparison by IoU is carried out. If for two compared IoU > 0.5, then the bbox with the lower confidence factor is discarded. Otherwise, both bboxes remain on the list. Thus, similar bboxes are thinned out.

The loss function is composite and has the following form:

The first term is the loss for the coordinates of the object’s center, the second for the dimensions of the bbox, the third for the class of the object, the fourth for the class if the object is absent, and the fifth for the probabilities of finding some object in the bbox.

The lambda coefficients are needed to prevent the confidence from going to zero due to the fact that there are no objects in most of the cells. 1(obj, i) means whether the center of the object appears in cell i, 1(obj, i, j) means that the jth bbox in cell i is responsible for this prediction.

Advantages

  • High speed
  • Better generalizing abilities than competitors at that time — testing on another domain (pictures; training was conducted on ImageNet) showed better performance.
  • Fewer false positives on the background part of the image.

Limitations

  • Restrictions in the form of 2 bboxes and one class of objects per cell. This means that a bunch of small objects are less recognized.
  • Several successive downsamples of the original image lead to not so high accuracy.
  • Loss is designed in such a way that it equally penalizes errors on large bboxes and small ones. The authors made an attempt to compensate for the effect by taking the root of the size, but this did not eliminate the effect completely.

YOLOv2 / YOLO9000

Authors

Joseph Redmon, Ali Farhadi

Main article

“YOLO9000: Better, Faster, Stronger”, publication date 2016/12

Repositories

  1. https://pjreddie.com/darknet/yolov2/
  2. https://github.com/experiencor/keras-yolo2, 795 forks / 1.7k stars, MIT license
  3. https://github.com/longcw/yolo2-pytorch, 417 forks / 1.5k stars
  4. https://github.com/philipperemy/yolo-9000, 309 forks / 1.1k stars, Apache-2.0 license

Performance Comparison

Detection frameworks on Pascal VOC 2007. YOLOv2 is faster and more accurate than prior detection methods. It can also run at different resolutions for an easy tradeoff between speed and accuracy. Each YOLOv2 entry is actually the same trained model with the same weights, just evaluated at a different size. All timing information is on a Geforce GTX Titan X (original, not Pascal model).

Architectural Features

  1. The authors made a number of improvements to the first version of the model.
  2. Removed dropout and added batchnorm in all convolutional layers.
  3. Pre-trained as a classifier at 448x448 resolution (YOLOv1 at 224x224), then the final network was shrunk to 416x416 input to make an odd number of 13x13 cells.
  4. Removed fully connected layers. Instead, they started using fully convolutional ones and anchors for predicting bboxes (as in Faster RCNN). This gives less loss of spatial information (as it was in v1 in fully connected layers).
  5. Removed one maxpool to increase the detail (resolution) of features. In v1, there were only 98 bboxes per picture; with anchors in v2, it turns out more than 1000 bboxes, while mAP dipped a little, but recall increased significantly, which makes it possible to improve overall accuracy.
  6. Dimension Priors. The sizes and location of bboxes are not chosen by hand at random, as in FasterRCNN, but automatically by k-means clustering. Although with standard k-means with Euclidean distance on small bboxes, the error in detection was higher, so for k-means, another distance metric was chosen, 1 — IoU(box, centroid). 5 were chosen as a compromise option for the number of clusters. Testing showed that for 5 centroids chosen in this way, the average IoU was approximately the same as for 9 anchors.
  7. Direct location prediction. With anchors initially there was instability in network training associated with determining the coordinates of the center (x, y) — due to the fact that the network weights were initialized randomly, and the coordinate prediction was linear with unlimited coefficients. Instead of predicting the offset relative to the center of the anchor, where the correct range of the coefficient is [-1; 1], we decided to predict bbox relative to the center of the cell, for the coefficients — the range [0; 1] and used the sigmoid to limit it. The network predicts 5 bboxes for each cell, for each bbox 5 numbers: tx, ty, tw, th, to. The predicted parameters of bboxes are calculated as follows:

Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of the filter application using a sigmoid function.

8. Fine-grained features. The feature map is now 13x13.

9. Multi-scale training. Since the network is fully convolutional, its resolution can be changed on the fly by simply changing the resolution of the input image. To increase the robustness of the network, its input resolution was changed every 10 batches. Because the network downscales by a factor of 32, then the input resolution was chosen from the set {320, 352, …, 608}. The network was resized to sizes from 320x320 to 608x608, and training continued.

10. Acceleration. VGG-16, which was taken as a backbone for v1, was too heavy, so Darknet-19 was used instead in the second version:

After training the classifier, the last convolutional layer was removed from the network, three 3x3 convolutional layers with 1024 filters and a final 1x1 with the number of outputs required for detection were added. In the case of VOC, it was 5 bboxes with 5 coordinates each and 20 classes per bbox, for a total of 125 filters.

11. Hierarchical classification. While in v1 the classes belonged to the same category of objects and were mutually exclusive, in v2 a WordNet tree structure was introduced, which is a directed graph. The classes in each category are mutually exclusive and have their own softmax. Thus, if, for example, the picture shows a dog of a known breed network, then the network will return a class for both the dog and the specific breed. If it is a dog of a breed unknown to the network, then it will only return the class of the dog. Thus, YOLO9000 was trained, which is v2 with 3 priors instead of 5 and 9418 object classes.

Prediction on ImageNet vs WordTree. Most ImageNet models use one large softmax to predict a probability distribution. Using WordTree, we perform multiple softmax operations over co-hyponyms.
Combining datasets using WordTree hierarchy. Using the WordNet concept graph, we build a hierarchical tree of visual concepts. Then we can merge datasets together by mapping the classes in the dataset to synsets in the tree. This is a simplified view of WordTree for illustration purposes.

Advantages

  • now it is SotA not only in terms of speed, but also in terms of mAP
  • small objects are now detected better

Limitations

not found

YOLOv3

Authors

Joseph Redmon, Ali Farhadi

Main article

“YOLOv3: An Incremental Improvement”, publication date 2018/04

Repositories

1. https://pjreddie.com/darknet/yolo/, all-permissive license

2. https://github.com/ultralytics/yolov3, 3.3k forks / 8.9k stars, GPL-3.0 license

3. https://github.com/eriklindernoren/PyTorch-YOLOv3, 2.6k forks / 6.8k stars, GPL-3.0 license

Performance Comparison

YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU.

Architectural Features

This is an incremental update of the model, i.e. there are no cardinal changes, there is only a set of several improving tricks.

  1. The objectness score, i.e. the probability that there is an object in a given bbox, for each bbox is now also calculated using the sigmoid.
  2. Authors switched from multiclass classification to multilabel, so we got rid of softmaxes in favor of binary cross-entropy.
  3. Predictions are made for bboxes at three scales, output tensor size: N * N * (3 * (4 + 1 + num_classes))
  4. The authors recalculated the priors using k-means, and got 9 bboxes on three scales.
  5. New, deeper, and more accurate backbone/feature extractor Darknet-53

6. In terms of accuracy, it is comparable to ResNet-152, but requires almost 1.5 times fewer operations and produces 2 times higher FPS, due to more efficient use of the GPU.

General architecture:

Approaches that didn’t work

  • bbox coordinates displacement prediction with linear activation instead of logistic one.
  • focal loss — mAP fell by 2 points.
  • dual IoU for determining truth — in Faster R-CNN there are two thresholds for IOU, by which a positive example is determined or negative (>0.7 positive, 0.3–0.7 ignored, <0.3 negative)

Advantages

  • detection accuracy at the time of release is higher than that of competitors
  • detection rate at the time of release is higher than that of competitors

Limitations

not found

In the next part we will consider v4, v5, PP-YOLOs, and YOLOX. Stay tuned!

--

--