YOLO v1: Part 2

Divakar Kapil
5 min readMay 10, 2018

--

This is the second post on YOLO v1 following YOLO v1: Part 1. In this post I will cover the working of the neural network designed to accomplish the task of object detection. A quick recap of the previous post which goes over the architecture and the benefits of the network.

  1. YOLO consists of a single CNN which makes it very fast during inference
  2. It treats object detection as a regression problem instead of a classification problem
  3. It looks at the entire input image to learn globally rather than locally
  4. Input image needs to have 448x448 dimensions

With these main points in mind, let’s dive in the working of the network.

WORKING

The single neural network unifies all the components for object detection. It uses features from the entire image to predict each bounding box. It predicts all bounding boxes across all classes for an image simultaneously as it looks at the input picture only once. YOLO has been trianed to identify 20 classes of objects. The input image fed to the network is divided into a grid of dimension say S x S. Each grid cell is responsible for identifying whether it contains the center of an object belonging to any of the 20 classes. If the grid cell contains the center of the object it predicts say B bounding boxes to enclose the object. In addition to producing B bounding boxes, every grid cell is also responsible for producing confidence scores for each bounding box. Bounding boxes are a mechanism for localising objects in an image or video. Note that a bounding box can enclose only 1 object at a time.

A bounding box has 4 dimensions namely, the x and y coordinates of the center of the object and the height and width of the box in reference to the origin of the grid cell. Typically, the origin of the grid cell is defined as the top left corner. Each bounding box consists of 5 parameters:

y = [ pc, bx, by, bh, bw]

a) pc = Po x IOU : Confidence score of the box

Confidence score of a bounding box tells us how confident is the model in predicting the object and how accurate it thinks the bounding box that it created is. ‘pc’ is computed by multiplying the probability of the object being present in the box(Po) and the intersection over union (IOU) between the predicted box and the ground truth.

b) bx and by : coordinates of the centre of the object

The center of the object detected by the grid cell has an x and y coordinate measured from the top left corner of the grid cell which is the origin.

c) bh and bw : height and width of the bounding box

The height and width of the predicted bounding box by the grid cell mesured from the origin of the cell.

Each grid cell apart from the bounding box and confidence score of each box also has say ‘C’ class probabilities which are essentially the conditional probabilities of the type of object present in the bounding box denoted as Pr(class|object). Note a grid cell can produce ‘C’ class probabilities as the network is trained to identify these ‘C’ classes however only 1 set of class probabilities is predicted per cell. That is the grid cell in YOLO v1 is capable of only predicting 1 object eg a cat or dog or car etc. It cannot predict multiple objects eg a grid cell can not predict to have a cat and a dog in it at the same time. This is one major limitation of YOLO v1. It is unable to localise and identify more than 1 object per grid cell.

So, the dimension of the output produced by the SxS grid is:

S*S*(B*5 + C)

For example, the network evaluated on the PASCAL set creates a grid of size 7x7. Each grid cell predicts 2 bounding boxes and produces 20 class probabilities, the output will be a tensor of dimension 7x7x30 [1].

Note: The reason that a grid cell only predicts 1 class probability is due to the fact that fully connected layers are used to regress bounding boxes. One regression header can only regress a bounding box for 1 class object at a time. This means that a model using only 1 fully connected layer (1 regression head) will be unable to produce accurate results if there exists multiple objects of different classes anywhere in the image. The single layer will struggle to regress the bounding box due to interference of the two distinct class objects trying to use the same layer.

This problem can be partially fixed by training 1 regressor head for every class that means 20 regressor heads for 20 classes. This will direct each class object to a fully connected layer each. Thus, this does solve the problem of interference faced by a single fully connected layer due to distinct class objects. However, there still exists the problem of interference due to multiple instances of the same class for example, if the image has many dogs. Since, the class dog is assigned only 1 fully connected layer to regress the bounding box, itwill face problems as many dogs will try to share the same layer.

Hence, this still leaves us with the problem of spatial interference of boxes for multiple objects of the same class which reduces accuracy for localisation. The concept of anchor boxes is used to predict multiple objects per grid cell. This is used by YOLO v2.

In short, the solution is to train each fully connected layer (regression head) to consider only a limited region in the image rather than considering the entire image (which is the cause of interference). To achieve this anchor boxes are used. When there is a high IOU of the ground truth bounding box and an anchor box, the regression head associated with that anchor box is given the responsibility of regressing the final bounding box. Hence, the anchor box limits the region considered by a regressor head. This results in less interferences due to objects present outside the region. There still exists the problem of detecting multiple objects of the same class with very high overlap with each other.

The following image summarizes the working of the network.

Working of YOLO v1 [2]

The next part of the series YOLO v1 : Part 3 will conclude with the explanation of the cost function and the limitations of the network. Stay tuned for the final part of the series :)

If you like this post or found it useful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.

References:

[1] https://arxiv.org/pdf/1506.02640.pdf

[2] https://www.slideshare.net/TaegyunJeon1/pr12-you-only-look-once-yolo-unified-realtime-object-detection

--

--

Divakar Kapil

4th year CE undergrad at University of Waterloo | Machine Learning enthusiast :)