YOLO v3 for object detection

A gentle approach

pierluigidibari

Published in

Data Reply IT | DataTech

12 min readJul 28, 2022

Example of object detection. Original photo by Heather M. Edwards on Unsplash

What is YOLO v3?

YOLO v3 is a popular Convolutional Neural Network (CNN) for real-time object detection, published in 2018 by J. Redmon et al. At its release time, it represented the state of the art for this task, ensuring accurate and fast performance like never before, and even today it is still a more than valid tool. With a Titan X GPU, it can greatly process 320x320 images at 45 fps (22 ms for inference). Given an input image, the goal of object detection is not only to recognize subjects within the image but also to define their location through bounding boxes. It found many uses and applications over time, heavily affecting several industrial areas. Just think of tasks such as vehicle or pedestrian detection for autonomous driving, animal detection for forest monitoring, or the detection of specific components for manufacturing scopes.

You Only Look Once.

The name YOLO stands for “You Only Look Once” and it refers to the monolithic architecture of the model. In fact, YOLO v3 is the first end-to-end network of its kind. It means that input images are consumed in a single stage, during which object locations and related classes are detected at once. This guarantees enormous advantages in terms of training and inference speed. Before YOLO v3, competing models used to process inputs heavily, taking up to a minute to elaborate a single image and making real-time applications impossible. For instance, given an input image, R-CNN (Region-based CNN) applies a selective search to extract about 2000 candidate regions. After that, the most similar regions are combined to generate the final proposals. Only at this point, each one of the selected regions passes through a CNN to classify the objects within them.

YOLO v4, YOLO v5 or PP-YOLO?

Actually, YOLO v3 is not the last product of the YOLO family. Several versions were later published, such as YOLO v4, YOLO v5 and PP-YOLO among the most known, all by different authors, in 2020 at a very short distance from each other. In fact, the story behind them is rather controversial. YOLO v4 is the last model to use DarkNet-53, the YOLO v3 backbone. PP-YOLO is a re-implementation of YOLO v3 based on the deep learning framework PaddlePaddle. YOLO v5 instead, is the most debated of the three in the research community, as it was missing peer reviews at its publication and because of the promised performance. YOLO v3 is the last work handled directly by J. Redmon, who can be considered the father of the YOLO product family. He is said to have left the research in this field after YOLO v3 was used for military purposes unwelcome to the author. Moreover, he declared to recognize only YOLO v4 as the true worthy successor of YOLO v3. Thus, the reason behind the deepening of a no longer state-of-the-art model is to be found in the impact it had in Computer Vision. Moreover, it laid the foundations for all subsequent architectures: understanding YOLO v3 gives the opportunity to better comprehend and choose between the latest models for object detection.

Architecture.

Although its simple structure, there would be much to say about the YOLO v3 architecture and functioning. The following sections show how an input image is fully processed by YOLO v3 at inference time, focusing on some model key points.

CNN.

Let’s start from the basics: YOLO v3 is a CNN. Oversimplifying, a CNN is a deep neural network designed primarily to process images. It differs from a more classical neural network in the presence of convolutional layers. Each convolutional layer can be thought of as a set of filters (such as gaussian blur or edge detector) applied to the image for extracting useful features. These filters are dynamically learned during the network training. A filter consists of a matrix called kernel, usually of size 3x3 or 1x1 like in YOLO v3, which is applied to the image through a convolution operation. The kernel moves along the whole image with a specific stride (the number of pixels), processing the underlying content at each step. The size of the kernel and stride determine the dimensions of the filter output.

DarkNet-53.

DarkNet-53 is the YOLO v3 backbone, responsible for feature extraction. It is a CNN made up of 52 convolutional layers, for an overall of 53 layers, and many skip connections. Skip connections are another key concept of YOLO v3. It is empirically known that deeper networks struggle with learning simple functions such as the identity function. These functions are well estimated in the very first layers, but their quality degrades as the network gets deeper: skip connections solve this problem. Usually, in a neural network, the output of a layer constitutes the input of the next one. Skip connections allow one layer output to bypass the immediately following layers, in order to be considered again later. With this approach, the simpler information (referred to as residual block) that does not travel through the layers is stored; on the other hand, more complex information is extracted by passing through the layers as usual. Subsequently, the simple and complex information is combined with a sum operation.

Going back to YOLO v3, it is made up of DarkNet-53 plus 53 additional layers, for an overall of 106 layers.

Input.

Now, let’s talk about the network input. In the case we will examine, YOLO v3 takes as input images of size 608x608 times the number of channels (3 for RGB images, 1 for grayscale ones). As we will see, these dimensions are not casual. Of course, the network is able to process images of any size, as long as they are resized to 608x608. Additionally, you can choose to simply resize the entire image or to resize and crop it in order to keep the original aspect ratio.

Passing through DarkNet-53, the extracted feature maps are scaled to the sizes 76x76, 38x38 and 19x19, by using strides with values 8, 16 and 32 respectively. Note that the input size of 608 is perfectly divisible by 32, allowing to achieve the scales just described. Therefore, the network could actually accept any input whose dimensions are a multiple of 32.

We will feed YOLO v3 with the image of a nice orange-chinned parakeet. 🐦

Anchor boxes.

Anchor boxes are another key point of YOLO v3. They represent ideal bounding boxes and are discovered during the training phase by performing clustering on the known boxes of the examples. Therefore, anchor boxes describe the most recurring width/height ratios in the ground truth. For each scaling step (76x76, 38x38 and 19x19), three anchor boxes are extracted, for an overall of nine boxes. Anchor boxes are used at inference time to guide the definition of candidate bounding box size. In particular, bounding box size is expressed in terms of width and height ratio of the most suitable anchor box. Indeed, it has been observed that it is easier to express the size of a bounding box by considering anchor boxes, rather than starting completely from scratch.

1st object detection.

Let’s move to the tastiest section: the object detection steps. YOLO v3 performs three stages of object detection at three distinct scales, in order to recognize objects of different dimensions. The first detection occurs at the 82nd layer and it is responsible for recognizing the largest subjects within the image. Here, after passing through DarkNet-53 the feature map has a size of 19x19x1024. For the sake of simplicity, we will now refer to the feature map as an image of size 19x19, ignoring his depth. Therefore, the image is subjected to a convolution operation with kernel size of 1x1. The output of this convolution will have dimensions of 19 x 19 x (B x (5 + C)), where 19x19 refers to the current width and height of the image. Let’s examine more in detail, what (B x (5 + C)) means:

B stands for the number of bounding boxes identifiable for each image cell. In other terms, for each cell of the image YOLO v3 tries to predict B possible bounding boxes of belonging. In YOLO v3, B is equal to 3.
5 is the number of attributes describing each candidate bounding box. They are:
- (tx, ty): the offset of the centre of the bounding box with respect to the cell of interest.
- (tw, th): the size of the bounding box, expressed as the scale of the most suitable anchor box (for instance, if the anchor box has size 100x100 and the bounding box has size 90x80, then tw = 0.9 and th = 0.8).
- p₀: the objectness score. It describes the probability that the cell of interest is the centre of the bounding box and is consequently more significant.
C stands for the number of recognizable classes. Therefore, C values {p₁, p₂, …, pc} describe the probability that the bounding box belongs to each of the known classes. Since most of the YOLO v3 distributions are pre-trained on the dataset COCO, which describes 80 different classes, in our case we will say that C is equal to 80.

To recap, three bounding boxes are identified for each image cell, for an overall of 19x19x3 = 1083 bounding boxes at the first detection stage. Each bounding box is described by the following values (tx, ty, tw, th, p₀, {p₁, p₂, …, pc}). The class score of each bounding box is given by multiplying the objectness score by the highest probability of belonging to a class (p₀ MAX ({p₁, p₂, …, pc})).

Result of the first object detection stage

2nd and 3rd object detection.

Once the first detection is over, let’s take a step back just before it. The image of size 19x19 is up-sampled to 38x38, and passes through several convolutional layers and skip connections. Before the second detection, the image is concatenated to the one coming from the 61st layer, having the same size of 38x38. This is because the current image can be thought of as a low-resolution image, whose features have been widely processed. Vice versa, the output of the 61st layer is an image with a higher resolution, whose features have not yet been sufficiently processed. Therefore, combining the two images could potentially improve the detection results. The second object detection occurs at the 94th layer and it is responsible for recognizing the medium-sized subjects within the image. It works exactly like the first detection and this time 38x38x3 = 4332 candidate bounding boxes are extracted.

In the same way, before the third detection, the image of size 38x38 is up-sampled to 76x76, it passes through further convolutional layers and skip connections and it is concatenated to the output of the 36th layer, having the same size of 76x76. Finally, the third and last object detection occurs at the 106th layer and it is responsible for recognizing the smaller subjects within the image. 76x76x3 = 17328 bounding boxes are extracted at this stage.

Non-maximum suppression.

Only one step away, we are nearly there! After the three stages of object detection are completed, we have an overall of 1083+4332+17328 = 22743 candidate bounding boxes, each with their own class score. Of course, not all of them represent a good result, so we need to filter them. The non-maximum suppression algorithm considers a set B of all the candidate bounding boxes and a set D of the definitive bounding boxes, initially empty. The bounding box with the greatest score is moved from B to D, then it is compared to all the other boxes left in B in terms of IoU. The IoU (Intersection over Unit) is a metric adopted in detection and segmentation tasks, that describes the overlapping between two bounding boxes (usually the ground truth and the predicted one). If the IoU is greater than a certain threshold, the box in B is discarded since it is too similar to the one just extracted and has a lower score. Once the comparison is over, the procedure is repeated moving the best bounding boxes from B to D, until B is empty. Finally, you can further filter the bounding boxes in B by rejecting those with a score lower than a fixed threshold.

B: set of candidate bounding boxes, D: set of definitive bounding boxes
1. Consider the bounding box b with maximum score in B and move it to D
2. Compare b with all the other bounding boxes bi in B
 — If bi is too similar to b in terms of IoU, remove bi from B
3. Repeat from 1. until B is empty

Here we go! The prediction is now complete. 😎

Final result of the object detection. Original photo by Zdeněk Macháček on Unsplash

Demo.

As DarkNet is open source, there are many versions of YOLO v3 ready to use for different use cases. For instance, you can easily find Python notebooks for performing object detection on images, videos or even in real-time through a webcam. Besides this, there are also tons of guides and tutorials explaining how to use YOLO v3, re-train it, perform transfer learning and even build it from scratch. Given the amount of material available, in this section we will simply show how you can easily use a pre-trained model to perform object detection on images, with very few lines of code. In this demo, we will use one of the most diffuse YOLO v3 frameworks accessible from the GitHub repository (https://github.com/AlexeyAB/darknet). You can test the following code in a Google Colab notebook.

First, you need to clone the GitHub repository. You can enable the use of GPU and OpenCV by editing the Makefile, otherwise the model will be executed on the CPU. After that, run the Makefile and compile what you need, since the framework is mainly written in C.

! git clone https://github.com/pjreddie/darknet
! sed -i ‘s/GPU=0/GPU=1/g’ Makefile
! sed -i ‘s/OPENCV=0/OPENCV=1/g’ Makefile
% cd darknet
! make

After that, you need to download the weights of the pre-trained model. In particular, the weights we will use are the result of the model trained on COCO.

! wget https://pjreddie.com/media/files/yolov3.weights

You almost done it! Now you have only to run the model. You can choose from the images already available in darknet/data or upload your own.

! ./darknet detect cfg/yolov3.cfg yolov3.weights data/giraffe.jpg

That’s all! You performed an object detection prediction with YOLO v3 in just 7 lines of code. In the output, you will see some information such as the prediction duration and the confidence scores related to the objects recognized. The prediction duration may considerably change if you choose to use CPU or GPU. Finally, you can visually appreciate the results of the prediction by opening the image darknet/predictions.jpg.

Examples of object prediction performed on some images in darknet/data.

Conclusions.

Here we are at the end of this journey with YOLO v3! As we have seen, YOLO v3 is a powerful CNN for real-time object detection. It has been the first end-to-end network of its kind and it laid the foundation for all subsequent architectures. We analysed how it works, by observing the full processing of an input image, from its acquisition to the prediction serving. This allowed us to get an overview of the YOLO v3 architecture and to delve into some key features, such as skip connections, anchor boxes or the non-maximum suppression algorithm. This was a “gentle approach” to YOLO v3, but there is still much to be said. For instance, given its simple structure, what really contributes to the impressive predictive power of YOLO v3 is its loss function during training. It is a complex function that combines coordinate, objectness and classification loss, intending to minimize these three errors. Finally, we have seen how easy it is to use a pre-trained model for prediction and how many solutions are available online for the most different use cases. In conclusion, despite its four years of age YOLO v3 is still more than a valid tool for object detection tasks. Of course, it represents no longer the state of the art: to date, there exist a lot of solutions able to satisfy different requirements in terms of accuracy, fps, computational costs, portability and so on. However, the goal of this article was to provide a comprehensible description of the YOLO v3 functioning, so that anyone approaching object detection can understand all the other models based on it and choose the best for himself.

Thanks for reading, I hope it was useful! Have fun with YOLO v3! 🎉

References.

YOLO v3: An Incremental Improvement