YOLO v1: Part3

Divakar Kapil
Escapades in Machine Learning
4 min readMay 23, 2018

Introduction

This is the final part in the YOLOv1 series following YOLOv1: Part 2. The previous part covered the working of the neural network detailing the architecture and the convolution process. This part will cover the details of the loss function and some limitations of the YOLO architecture. A general heuristic is to opt for an MSE (mean squared error) for regression models and Entropy loss for classification models. YOLO treats object detection as a regression problem hence uses an SSE (sum squared error).

Loss Function

The function is a composition of multiple SSEs. During training, this loss function is optimized to improve the predictions of the network. SSE has a benefit over other loss functions as it is easier to use and optimize. It is noteworthy that usually loss functions are chosen or designed keeping the ease of optimization in mind. For example, a cross entropy loss function is a negative logarithmic function which is smooth and convex. Both these properties make it easier and quicker to optimize hence, improving training time and results. Following is the formula for the SSE function used by YOLO:

Fig1. Loss Function Yolo [1]

There are two points worth noting in the above formula:

  1. Differential weights are used for confidence predictions from boxes that contain objects and boxes that are empty during training. This means that the loss function only penalizes classification error for the boxes that contain an object within themselves. It doesn’t penalize empty boxes. Thus, the conditional probability of an object being present in a box plays an important role in deciding which bounding boxes factor in the loss function.
  2. The square root of the predicted boxes’ height and width are used as weights to penalize detections of both large and small objects. This clearly indicates that the loss function treats both errors with no difference which is not ideal [1].

Also, it is important to note that the loss function only penalizes classification error if an object is present in the grid cell and it only penalizes the bounding box coordinate error if that predictor is responsible for the ground truth label that is it has the highest IOU of any predictor in that grid cell [1].

One can see from the two points mentioned above that the loss function does indeed have a few issues. The issues and the suggested fixes are listed as follows:

  1. The loss fucntion weighs both the localization and classification errors equally which is not ideal. This issue can be demonstrsted as follows. Consider a scenario where no objects are present in the image or video frame. The confidence scores for the grid cells in this case will be 0. This overpowers the gradient from the cells that contain objects but these objects are not classified as the labels for these objects are not present in YOLO’s training. This can lead to model instability. A potential fix is to increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain any object. That is weigh the loss from localization more than the classification one.
  2. The loss function equally weighs the errors in large bounding boxes and small bounding boxes. This is because the sum of squared errors penalizes the deviation independent of the size of the predicted bounding box. This is not ideal as small deviations in larger boxes have less impact on the accuracy of predictions than the small deviations in smaller boxes which can have a large impact on the predictions. This can be fixed by using square root of the dimensions of the bounding boxes to penalize the error in larger boxes than using full dimensions.

Limitations

The first version of YOLO suffers from some limitations which enable it to be fast but its accuracy takes a hit. The main source of errors come from localization. Following are the limitations:

  1. The model imposes strong spatial constraints on the bounding box predictions. This is so because the model uses a fully connectd layer to regress the bounding box. As mentioned in YOLO v1: Part2, each layer can only predict 1 class object. The grid cells predict two bounding boxes but a grid cell can only have 1 object in it. The presence of multiple objects results in shared space amongst bounding boxes with the different objects. This overlap causes confusion for the fully connected layer. Thus, this limits the number of nearby objects the model can predict.
  2. The model samples down the input image to an SxS grid where every grid cell is responsible for making bounding box predictions. Thus, due to the downsampling the model uses rather coarse features to predict the bounding boxes.
  3. It finds it difficult to localize small objects or groups of small objects. Hence, the main source of errors is localization.

This concludes the multi-part series on YOLOv1. Hope this series helps to explain the concepts of the paper in an easier manner. I will be writing a post on the features of YOLOv2 and how they improve YOLOv1. So, stay tuned for the next chapter on YOLO :)

If you like this post or found it useful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.

References

[1] https://arxiv.org/pdf/1506.02640.pdf

--

--

Divakar Kapil
Escapades in Machine Learning

4th year CE undergrad at University of Waterloo | Machine Learning enthusiast :)