YOLO: You Only Look Once.

Shiv Vignesh
Analytics Vidhya
Published in
12 min readJun 10, 2020

YOLO object detection algorithm which can perform classification and object localization(detection) at the same time, looking at the image just once. Hence, the name of the algorithm is You Only Look Once.

1. Introduction

Over the years, the field of computer vision has been living and growing with us, from Instagram filters, Google Lens to Tesla cars which are the products inspired by the creation of computer vision algorithms. In this article, I will explain to you the working principle behind the most popular object detection algorithm YOLOv3. But before that, let me try and explain the difference between, classification task vs object detection.

Note: All the 3 versions of YOLO have a similar working principle, with minor changes in the network architecture which aid in an overall improvement of its performance.

Prerequisites

To thoroughly understand this article

  • You should understand the working behind Convolutional Neural Networks.
  • Be able to construct simple neural networks with ease.
  • Most importantly, the desire to learn.

2. Classification vs Object Detection

Classification is the task of predicting the class of object in the image. For example, an image classification network trained to classify a deck of cards, or differentiate between an image of a cat vs dog.

Object detection is the task of identifying the location of the object as well as the class of the object in the image. The object in the image is enclosed in a rectangular box indicating the class of the object as well.
For example, if one wants to count the number of cars at a traffic junction or count the number of faces while clicking a selfie, the task of object detection is employed. The task of classification cannot be applied directly on an input image feed in real-world scenarios, it is always accompanied by the task of detection or segmentation.

3. Working Principle of YOLO

YOLO uses a single CNN to predict the classes of objects as well as to detect the location of the objects by looking at the image just once. Let us first look at the network architecture of YOLO.

fig 1

Let me try and describe the above diagram (fig 1) in words, YOLO network accepts an image of fixed input dimension. Theoretically, YOLO is invariant(flexible) to the size of the input image. However, in practice, we resize our input to a fixed dimension of 416x416 or 608x608. Doing this allows us to process images in batches (images in a batch can be processed in parallel by GPUs) which help us train the network faster.

Multiple convolutions are applied to the image as it is propagated forward through the network learning the features, colour, shape & many more aspects of the object. At each layer, we obtain a convoluted image aka feature map of that layer. The output of a CNN layer is a 3D feature map. Each depth channel encodes a feature of the image or object.

To learn more about the feature maps.

After a certain number of convolutions when the network stride reaches 32, we get the output feature map, the layer we obtain it from is called a detection layer.

3.1 How is this output feature map interpreted?

Output feature map is the resultant tensor representing the features learned by all the preceding convolutional layers, as it passes the image from the input layer to the detection/output layer. The output/detected feature map is of 13x13x125.

3.2 Let’s break down to see what’s inside it.

13 is the width and height of the output feature map. Each cell (square) in the 13x13 can see a region/portion of the input image as a result of the convolutions that took place. This is called a receptive field. The receptive field of the network is the region of the input image that is visible to a cell (neuron) in the output feature map.

Also, each cell in the 13x13x125 feature map has 5 bounding boxes to detect the objects in the image. A cell can detect an object in the image via one of its 5 bounding boxes only if the object falls within the receptive field of that cell (Receptive field is the region/portion of the input image visible to that cell).

fig 2

To do that, YOLO divides the input image into 13x13 grids. Each cell in the 13x13x125 output feature map represents each corresponding 13x13 grid of the input image. (Say, red cell of feature map represents the red grid on the dog’s image). (Each square 13x13 on the dog’s image is referred to as a grid & each neuron on the 13x13 feature map is called as a cell.)

Now as each cell in 13x13x125 has 5 bounding boxes, these bounding boxes can be localised (used to locate objects) using this 13x13 grid on the input.

If the centre/midpoint of the object falls within a particular grid (red grid contains the midpoint of the dog) that grid is responsible for detecting the object.

In simpler terms, the bounding boxes which enclose an object or a part of an object are the boxes that will be used to detect/localise the objects in the image. These boxes have a higher confidence score than others.

3.3 Now coming to the 3rd dimension in the tensor, 125. What does this represent?

This is where the actual output predictions are enclosed. As discussed each cell in the 13x13x125 feature map has 5 bounding boxes to make predictions.

Each bounding box is represented by its,

  • centroid-x (tx), centroid-y (ty), bounding-box-width (btw), bounding-box-height (th).
  • confidence/objectness score (Po)(probability of an object within a bounding box).
  • class probabilities (P1, P2,…) i.e. which class does the object belong to (softmax values of each class in the dataset).

In total, this accounts to 4 values from the bounding box coordinates, 1 value for objectness score, N classes it is being trained (Here, N=20). That adds up to 25 values per bounding box.

Each cell has 5 bounding boxes (YOLOv2), thus that makes 125 values per cell in the feature map.

Depthwise entries (the 13x13xd) in the feature map are governed by this formulae :

((5+C)xB)=Depthwise entries. (d)

x,y,w,h,object score — 5.

C — Class probability (C=20).

B — Number of bounding boxes per cell (B=5).

Putting these values in the formulae gives us the depthwise entries to be 125. 13x13x125 as shown in fig.

3.4 How many bounding boxes are produced by YOLO?

Each 13x13 cell detects objects in the input image via its specified number of bounding boxes 13x13xB. In YOLOv2, B=5. Total bounding boxes (13x13)x5=845.

In YOLOv3, each cell has 3 bounding boxes. So the total number of bounding boxes using 13x13 feature map would be.

(13x13)x3 = 507 bounding boxes.

From the above diagram, the bounding boxes which contain the dog or part of the dog will be the ones used to detect the dog in this picture. The remaining bounding boxes are discarded as they don’t localise the dog in the picture.

3.5 Prediction at different scales

YOLOv3 produces more than one output feature map for object detection. After reaching a stride of 32, the network produces a 13x13 feature map for an input image of size 416x416. YOLOv3 also produces feature maps at a stride of 16 & stride of 8. The layer at stride 16 produces 26x26 feature map & at stride 8 a feature map of 52x52 (416x416 input). Although the width & height of the feature map varies at different stride values, the number of depthwise entries which enclose bounding box coordinates, confidence score & class probabilities remain the same.

As the network propagates the image forward, at the first detection layer when the stride is 32 we obtain a 13x13 output feature map. The further layers(after first detection layer) are unsampled by a factor of 2 and concatenated with feature maps of previous layers having similar sizes. Another detection layer when the stride is 16 we obtain a 26x26 output feature map and 52x52 feature map at the detection layer when the stride is 8. Thus, the total number of bounding boxes by YOLOv3 when the input image size is 416x416 is

((13x13)+(26x26)+(52x52))x3 = 10647 bounding boxes per image.

Now, that is a lot of bounding boxes to detect objects in the image.

But do we need 10647 bounding boxes to detect one single dog in the image?
Before I attempt to answer this question, let us see how YOLO network predicts the dimensions & coordinates of the bounding boxes.

4. Bounding Box Predictions

In the first version of YOLO, the bounding box coordinates were the regression values of the output feature map.

The YOLOv1 network attempted to predict the bounding box coordinates & dimensions directly without any assumption on the shape of the target object. For example, humans in an image always fit within a rectangular box & not a square box, but the network couldn’t output a rectangular bounding box for humans in some scenarios. YOLOv1 failed to capture the generalized aspect ratio & sizes of objects in the data.

When the network was given the responsibility of predicting the bounding box coordinates & dimensions, it resulted in localization errors, mistakes in precise bounding box dimensions. In simple words, the network found it hard to output bounding box coordinates which were precisely marked on the object.

However, this drawback has been resolved in YOLOv2 by using something called Anchors.

4.1 Anchors

Anchors are pre-defined sizes of the objects from the training dataset. Since the bounding box coordinates predicted directly are very skewed(direct output of YOLO), this can be removed by applying log space transforms & then applying them to the pre-defined bounding boxes called anchors.

Anchors are prior bounding boxes which tend to capture the aspect ratio & the size of objects in the training data. For example, when one sees a car from its side, it will have an aspect ratio of 2:1 (w=2xh), when viewed from the front it will have an aspect ratio of 1:1 (square) also if there is a person involved, he/she will have an aspect ratio of 1:3 if they are standing. Similarly, objects in the foreground will have a bigger size of the bounding box, while the objects in the background will be of smaller size. Anchors are assumptions made on the shape & sizes of objects in the training data.

Anchors are calculated via K-Means Clustering of all the bounding-boxes in the training data. This procedure creates clusters of similar bounding boxes and chooses centroid to represent the dimensions of each cluster.

YOLOv3 generates 9 anchor boxes from k-means clustering. (3 bounding boxes per cell, predictions across 3 different scales 13x13,26x26 & 52x52. Total 3x3 = 9 anchors )

Anchors or prior bounding boxes are useful because YOLO can learn to make small adjustments to these anchors to correctly detect the box on the object.

4.2 How are these anchor adjustments made?

tx, ty, tw, th are the regression (direct) bounding box values from the detection layer. The sigmoid function is applied to the centroid-x & centroid-y that restricts the value between 0–1, making sure that the bounding box centroid remains within the grid. Cx and Cy are the top-left coordinates of the grid on the image.

pw & ph are anchor dimensions of that box. By applying the YOLO outputs to anchors rather than predicting them, YOLOv3 reduces localization errors and increases the precision of its prediction.

Note: Anchors can be any size, so they can extend beyond the boundaries of the 13x13 grid cells to detect large objects.

4.3 How to reduce 10647 boxes to a smaller number?

We don’t need multiple boxes to detect a single object in an image. A single bounding box looks elegant. There are two ways to limit the number of predicted boxes.

  1. Thresholding the confidence/objectness score
    The objectness score indicates the probability of the object within the bounding box to be an object, in other words, “confidence that the region enclosed by the bounding box contains an object”.
    By thresholding, the boxes to have a confidence/objectness score to be, say 0.5(50%) or higher we limit the number of bounding boxes.
  1. Non-Maximum Suppression.
    The number of bounding boxes near the region where the object is located will have many boxes persisting even after thresholding the confidence/objectness score. To reduce the number of boxes to a single box. We apply a process called Non-Maximum suppression.

Non-Maximum suppression is a widely used strategy to arrive at a single entity out of multiple overlapping entities. In YOLOv3 IOU metric is used in NMS.

IOU metric calculates the ratio of Area of overlap to the area of union between two bounding box outputs.

The boxes which have its IOU value equal to or greater than the threshold NMS value is selected.

The final trained output of YOLOv3.

5. YOLOv3 Loss Function

YOLOv3 uses sum-squared error between the predictions and the ground truth to calculate the loss.

The loss function is a combination of

  1. Classification loss.
  2. Localization loss. (error between the predicted bounding box and ground truth box).
  3. Confidence loss.

∑^B represents the sum of the loss values for all the bounding boxes in the cell w.r.t to its centroid-x,centroid-y, width, height & confidence score.

∑^S² represents the sum of all the loss values amongst all the cells in the output feature map. (S=13, in 13x13 map, will contain 169 cells).

1^obj is 1 when there is an object present in the cell. 0 when there is no object

1^noobj is 1 when there is no object in the cell & 0 when there is an object in the cell.

λs are constants. λ is the highest for coordinates to focus more on detection.

5.1 Classification Loss

If the cell contains an object, this the squared error of conditional class probabilities for each class.

5.2 Localization loss

Localization loss is calculated between the errors in bounding box centroids and its width & height.

We use width & height inside square-root function to penalize smaller bounding boxes, as we need precise bounding boxes for smaller objects than bigger (author’s call).

5.3 Confidence loss

The error in measuring the confidence of object being present within the bounding box.

We calculate confidence loss twice, once with 1^obj & the other with 1^noobj. This is done to make sure we reduce the confidence when no object is detected & increase confidence when an object is detected.

The total loss value from the network is the sum of classification loss, localization loss & confidence loss.

6. Conclusion

That’s a wrap for the theory of YOLO algorithm. The resources to the original version of YOLO paper& a most popular YOLO framework Darknet.

--

--

Shiv Vignesh
Analytics Vidhya

Fascinated by AI & Machine Learning. I like to read & write about it. :)