YOLO for Object Detection, Architecture Explained!

Published in

Analytics Vidhya

11 min readAug 29, 2021

In the previous article Introduction to Object Detection with RCNN Family Models we saw the RCNN Family Models which gave us the way for single stage object detector.

YOLO(You Only Look Once)
Single Shot Multibox Detector

After reading this article you will know:

How YOLO works
Challenges in YOLO
Limitations in YOLO
YOLOv3 architecture
How to implement YOLOv3 using OpenCV in python

Let’s start.

You Only Look Once (YOLO):

(You Only Look Once: Unified, Real-Time Object Detection) by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi in 2016.

Previous methods for object detection, like R-CNN and its variations, used a pipeline to perform this task in multiple steps. This can be slow to run and also hard to optimize, because each individual component must be trained separately.

AND in case of YOLO the name itself explains a lot about it, that it just goes through the entire image just once. Hence we will be exploring how YOLO works.

Let us first understand how YOLO encodes its output,

1. Input image is divided into NxN grid cells. For each object present on image, one grid cell is responsible for predicting object.

2. Each grid predicts ‘B’ bounding box and ‘C’ class probabilities. And bounding box consist of 5 components (x,y,w,h,confidence)

(x,y) = coordinates representing center of box

(w,h) = width and height of box

Confidence = represents presence/absence of any object

Let us see these with an example,

The cell which has center of object that cell determines or is responsible for detecting object.

Challenges in YOLO:

Question 1. How do we tell if the object detection algorithm is working well?

Solution: We have already seen this in previous article where we discussed R-CNN Family models, how do we evaluate object localization, it is by metric called Intersection over Union (IOU)

IOU = Area of Intersection / Area of Union

Question 2. There can be multiple bounding boxes for each object in an image, so how to deal with it?

Solution: We use Non-Max Suppression (NMS) which is the way to make sure that your algorithm detects your object only once.

We will see how this works,

Because, you are running image classification and localization algorithm on every grid cell, it is possible that many of the cells say that their ‘Pc’ Class Probability or chance of having object in that cell is highest.

So when we run algorithm, we might end up with multiple detections for same object.

So, what NMS does is that it cleans up other unwanted detections so we end up with one detection for particular object.

How does this NMS work?

1. First it looks for probabilities (Pc) associated with each of these detection for particular object

2. It takes largest ‘Pc’ which is most confident detection for the object

3. Having done that, the NMS part looks for all remaining bounding boxes and chooses all those bounding boxes which has high Intersection over Union (IOU) with the bounding box of highest ‘Pc’ and suppresses them.

4. Then we look for remaining bounding box and find highest ‘Pc’ and again NMS looks for remaining bounding boxes which has high IOU with bounding box of high ‘Pc’ and then they will get suppressed.

By doing this for every object we get only one bounding box for each object.

So for this example:

1. It takes largest Pc which is 0.9 in this case

2. It check IOU for all the remaining bounding boxes (i.e. for 0.6, 0.7 for Car 1 and 0.8, 0.7 for Car 2)

3. Now, NMS will suppress 0.6 and 0.7 for car 1 as they have high IOU with respect to bounding box of Pc=0.9, so like this we get only one bounding box for car 1 which is highlighted in the image.

4. Next, for remaining bounding boxes we have highest Pc=0.8 for car2 and again we check IOU for remaining boxes (i.e. 0.9 for car1 and 0.7 for car2)

5. Now, NMS will suppress 0.7 as it has high IOU with respect to bounding box of Pc=0.8. And we get only one bounding box for car 2 as well.

Question 3. What if we have multiple objects in a single cell? i.e. what if we have overlapping objects and midpoint of both objects lie in single grid cell?

Solution: We use multiple anchor boxes for solving this,

Each cell represents this output (Pc, x, y, h, w, c1, c2, c3) which is a vector of shape (8, 1) i.e. 8 rows and 1 column. c1, c2, c3 are the different classes say person, car, bike.

So, the shape of bounding box can change depending on number of classes.

Now, each cell won’t be able to output 2 detection so have to pick any one of the two detections to output.

With the idea of anchor boxes what you are going to do is predefine 2 different shapes called Anchor Box 1 and Anchor Box 2. By this we can do two predictions with 2 anchor boxes.

In general we can use more anchor boxes, to capture variety of shapes the object has.

So, for two anchor boxes how our output will be in case of 3 classes,

The output will be vector of size (16, 1) and the vector contains, (Pc1, x1, y1, h1, w1, c1, c2, c3, Pc2, x2, y2, h2, w2, c1, c2, c3)

Let’s say c1=person, c2=car, c3=bike

In our example as we have person and car so the output will be,

Combining all these steps we get our YOLO algorithm:

Limitations of YOLO:

YOLO can only predict a limited number of bounding boxes per grid cell, 2 in the original research paper. And though that number can be increased, only one class prediction can be made per cell, limiting the detections when multiple objects appear in a single grid cell. Thus, it struggles with bounding groups of small objects, such as flocks of birds, or multiple small objects of different classes.

Architecture of YOLOv3:

YOLO v3 uses a variant of Darknet, which originally has 53 layer network trained on ImageNet. For the task of detection, 53 more layers are stacked onto it, giving us a 106 layer fully convolutional underlying architecture for YOLO v3.

The detections are made at three layers 82nd, 94th and 106th layer.

Convolution layers in YOLOv3

It contains 53 convolutional layers, each followed by batch normalization layer and Leaky ReLU activation.
Convolution layer is used to convolve multiple filters on the images and produces multiple feature maps
No form of pooling is used and a convolutional layer with stride 2 is used to downsample the feature maps.
It helps in preventing loss of low-level features often attributed to pooling.

Now let’s look at the input, how does it look like,

The input is batch of images of shape (n, 416, 416, 3) where,

n=number of images, (416,416) = (width, height) and 3 channels (RGB).

The width and height can be changed to any number which is divisible by 32. These numbers (width, height) are also called as input network size.

Why divisible by 32?

We will see this in a moment.

Increase in resolution of input might increase accuracy after training. Input images can be of any size, we don’t need to resize them before feeding them to the network, they all will be resized according to input network size.

How the network detects objects?

YOLOv3 makes detections at 3 different places in the network. These are layers 82nd, 94th and 106th layer. Network down samples input image by following factors 32, 16 and 8 at 82nd, 94th, 106th layer accordingly, these numbers are called strides to the network and they show how the output at 3 places in the network are smaller than network input.

For network input (416, 416),

For 82nd layer the stride is 32 and the output size is 13x13 and it is responsible to detect large objects

For 94th layer the stride is 16 and the output size is 26x26 and it is responsible to detect medium objects

For 106th layer the stride is 8 and the output size is 52x52 and it is responsible to detect small objects

This is the reason why the network input must be divisible by 32, because if it is divisible by 32 then it is also divisible by 16 and 8 as well.

Here we can see, why 13x13, 26x26, 52x52 detect large, medium and small objects

Now let us see, what detection kernels are,

So to produce outputs YOLOv3 applies 1x1 kernels(filters) at three output layers in the network. 1x1 kernels are applied to down sampled images so that our output has same spatial dimensions 13x13, 26x26 and 52x52

The shape of the detection kernels also has its depth that is calculated by following formula,

(b*(5+c)) where, b = Number of Bounding Boxes, c = number of classes, 80 (for COCO dataset)

Each bounding box has (5+c attributes)

For YOLOv3 it predicts 3 bounding boxes for every cell of each of these 3 feature maps (i.e. for 13x13, 26x26, 52x52 feature maps)

As, b = 3, c = 80, we get (3*(5+80)) = 255 attributes.

Now, we can say each feature maps produced by detection kernels at 3 separate places (output layers) in network has one more dimension depth that incorporates 255 attributes of bounding boxes for COCO dataset and shapes of these feature maps are as following, (13x13x255), (26x26x255), (52x52x255).

Let us now see, Grid Cells, simply, they are detection cells.

We already know that YOLOv3 predicts 3 bounding box for every cell of feature maps. So what is task of YOLOv3 is identify the cell which contains center of the object.

Training YOLOv3,

When it is training it has one ground truth bounding box that is responsible for detecting for one object. So, we need to know which cells these bounding box belongs to.

For this YOLOv3, makes predictions at 3 scales, for strides 32, 16 and 8.

So, the cell which has center of the object is responsible for detecting that object.

Anchor Boxes used to predict bounding boxes, YOLOv3 uses predefined bounding boxes called as anchors/priors and also these anchors/priors are used to calculate real width and real height for predicted bounding boxes.

In total 9 anchor boxes are used, 3 anchor boxes for each scale, three biggest anchors for the first scale, the next three for the second scale, and the last three for the third. This means at each output layer every grid scale of feature map can predict 3 bounding boxes using 3 anchor boxes.

To calculate these anchors K-Means Clustering is applied in YOLOv3.

The width and height of anchors,

For, Scale 1: (116x90), (156x198), (373x326)

Scale 2: (30x61), (62x45), (59x119)

Scale 3: (10x13), (16x30), (33x23)

So, for, Scale 1: we have, 13x13x3 = 507 bounding box

Scale 2: we have, 26x26x3 = 2028 bounding box

Scale 3: we have, 52x52x3 = 8112 bounding box

In total, YOLOv3 predicts 10,847 boxes.

To predict real height and width of bounding box YOLOv3 calculates offsets also called as log-space transform. Let’s see that now,

So, to predict center coordinates of bounding boxes(bx, by) YOLOv3 passes outputs(tx, ty) through sigmoid function.

So, based on the above equations given in figure we get center coordinates and width and height of the bounding boxes.

And all the redundant bounding boxes from 10,847 boxes are suppressed using NMS.

Implementing YOLOv3 for Object Detection

We will be using Open Source Computer Vision Library(OpenCV)

# We will first import required libraries

# (Line 06)We will use this function “cv.dnn.readNet()” to read our weights file and configuration file (weights file contains pretrained weights which has been trained on coco dataset and the cfg file which is known as configuration file which has YOLOv3 network architecture)

# (Line 08,09)we store the confidence threshold and non-max suppression threshold as constants and (Line 11–13)read the coco.names file to extract the object names and put it in list, we have 80 classes in total

# (Line 16) using “cv.VideoCapture()” we read our video

# (Line 18) now we will loop over to read the captured video and do Object Detection task

# (Line 19)now “cap.read()” is reading the video and it will be read as frames and this function returns two outputs basically first is a boolean value “True/False” based on “Video read/not read” and the second one where the frames

# (Line 23)we cannot pass the frames directly to our network, we need to normalize the frames, also we need to resize the frames to (416,416) as YOLOv3 takes this input size and OpenCV reads image in ‘BGR’ format so we need to swap Blue and Red channels as well and then pass the frames to the network

# (Line 26–27)we also need to get output layers name and pass the names in forward pass

# but we are not done here as we need to visualize the result, for that we need to extract bounding box, with what confidence we predict the object and what class is the object

# (Line 34 -35) we use two for loops to extract bounding box, confidences and class ids. (Line 35-37)So the detections contain 85 parameters for each object detected, where first 5 parameters are bounding box and confidence and after that we have probability for predictions of 80 class

# first loop takes each one output layer from the 3 output layers in total and in second loop we will find the detected objects in each output layer

# (Line 38–49) based on the confidence threshold we extract the center of bounding box and its width and height and append them in the empty list which we have created

# In Line 44–45 we are extracting the top left coordinates of bounding box

# Now we have got all the bounding boxes, but we need only one bounding box each object so we pass the bounding boxes to NMS function from OpenCV

#(Line 51) based on the confidence threshold and nms threshold the boxes will get suppressed and we will get the box which detects the object correctly

# (Line 52–58) we are just fetching the required coordinates for bounding boxes and its confidence and label and assigning a color to it

#(Line 60–61) We are drawing the rectangle bounding boxes for objects detected and putting text above that bounding box

# here we show detected objects using cv.imshow()

Summary

In this post, you discovered a gentle introduction to the YOLO and how we implement YOLOv3 for object detection.

Specifically, you learned:

You learnt how YOLO works and how to deal with the challenges in YOLO and it’s limitations.
And also the architecture of YOLOv3.
And code for the object detection task using OpenCV library.

References

https://arxiv.org/abs/1506.02640 — -> paper for YOLO

Andrew NG’s YouTube video for YOLO Object Detection