Yolo Object Detectors: Final Layers and Loss Functions

Published in

Oracle Developers

11 min readNov 10, 2018

1.1 Motivation

Most deep object detectors consists of a feature extraction CNN (usually pre-trained on Imagenet and fine-tuned for detection) connected to a final layer that reshapes the features into the detector-specific output tensor. Switching up the feature CNN results in speed and accuracy changes as noted in [1], [2], [3]. But in many cases it is impractical in terms of memory and compute power for one to train an Imagenet CNN from scratch. Often times, we use an open-sourced, prebuilt model, adjusting the last layers and the loss functions to accomplish our task. The loss functions of one-stage object detectors, where one CNN produces the bounding box and class predictions, can be somewhat unusual because the prediction tensors are used to construct the truth tensor.

As part of the Oracle Machine Learning team, we have been reading the literature on such object detectors and producing explanations in the mathematical language that we prefer, in addition to creating diagrams, psuedo-code and mathematical formulas of our interpretation of what the authors meant, but left out. This is the write-up of the presentation we gave at an Oracle ML reading group. We originally wrote this in LaTex, but have converted our figures and equations to images to distribute on Medium.

1.2 Object Detection and PascalVOC

Given an image, the task of an object detector is to the return bounding box coordinates and name (class) of the objects that we care about in the image. Since it is difficult to talk about algorithms without concrete inputs, we take the PascalVOC dataset [4] as an example.

As shown in Figure 1, for each image, PascalVOC provides an annotation file containing the bounding box coordinates of objects in one of 20 classes. PascalVOC encodes bounding boxes by the top-left (x_min, y_min) and bottom-right (x_max, y_max) corner coordinates, but some object detection algorithms encode boxes using the center xy-coordinates with the width and height.

To feed an image into a convolutional neural network, the image is resized to be square. Since the PascalVOC bounding box coordinates depend on the image width W and height H, we normalize the box coordinates. Since the box encoding provided by PascalVOC is not the only way to encode bounding boxes, the normalization for both the corner-style and center-style encoding is shown in Equation 1.

Optimization requires numbers, so the name of the class of each bounding becomes an integer 𝕔 ∈ [1,20] since there are 20 classes in PascalVOC. For our purposes, to get from a class name to 𝕔, we find the index of the name in the list of PascalVOC classes following this order: [aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable, dog, horse, motorbike, person, pottedplant, sheep, sofa, train, tvmonitor].

Thus, for each image, the label is a list of objects represented by some normalized 4-dimensional bounding box b and an integer class id 𝕔.

2.1 Yolo v1

The Yolo was one of the first deep, one-stage detectors and since the first paper was published in CVPR 2016, each year has brought with it a new Yolo paper or tech report. We begin with Yolo v1 [1], but since we are primarily interested in analyzing loss functions, all we really need to know about the Yolo v1 CNN (Figure 2a), is that is takes an RGB image (448×448×3) and returns a cube (7×7×30), interpreted in (Figure 2b).

2.2 Yolo v1 bounding box encoding

To begin understanding the interpretation of the 7×7×30 output, we need to construct the Yolo-style label. Recall that the PascalVOC label for one image is a list of objects, each object being represented by a bounding box and a classification. The goal now is to convert the PascalVOC labels for one image into a form equivalent to the 7×7×30 tensor Yolo outputs. First, we need to convert from the center-normalized PascalVOC bounding box encoding to the Yolo bounding box encoding.

Instead of predicting the width and height directly, Yolo predicts the square roots to account for deviations in small numbers being more significant than the same deviation in big numbers. The square root mapping is used to expand smaller numbers, e.g. any number in [0,0.25] gets mapped to [0,0.5].

Instead of predicting the center of the bounding box normalized by the width and height of the image, Yolo predicts xy-offsets relative to a cell in a 7×7 grid. Once the image is divided into a 7×7 grid, for each object, we locate the grid cell (gx,gy) containing the object’s center. Having assigned the “responsibility” of predicting the object to a grid cell, we describe the center of the bounding box as offsets from the cell as shown in Figure 3, thus completing the construction of the Yolo-style bounding box.

2.3 Assign truth box to predicted box by max IoU

Having assigned the object to a grid cell, we can now construct the truth vector y_(gx,gy) ∈ [0,1]³⁰, which requires the predictions ŷ_(gx,gy) located at the grid cell in the 7×7×30 tensor outputted from the Yolo CNN. As seen in Figure 2, each grid cell predicts two bounding boxes with their respective object existence probabilities P(Object) and a class probability distribution, so each cell only predicts one object and, at prediction time, we select the bounding box with the highest value of P(Object), which is the probability the box contains an object.

To make explanations clearer, we denote b as the true object bounding box and b̂₁ and b̂₂ as the predicted bounding boxes, all of which are in the Yolo encoding style described in Equation 2. We use the object class 𝕔 to construct the true class probability vector p ∈ [0,1]²⁰, in which all elements are zero except at index 𝕔, so p[𝕔]= 1. We define p̂ to be the predicted class probability vector.

We denote as 𝕔̂₁and 𝕔̂₂ for the “confidence” that box1 and box2, respectively, contain an object (P(Object) for the respective boxes). We assign b to one of box1 or box2 based on which predicted bounding box has the highest Intersection over Union, aka Jaccard Index, with b. For reference, we defined the procedure to compute the IoU for two rectangles in Algorithm 1. We set c to be the maximum IoU, effectively using the IoU as a proxy for the confidence of assigning the object to the predicted box. This process results in the truth vector y_(gx,gy), an example of which is depicted in Figure 4.

2.4 Yolo v1 loss function

Having used the object encoded as (b, 𝕔) and the prediction from grid cell (gx,gy) to construct y_(gx,gy), we can now formulate the loss L_(gx,gy) for the grid cell responsible for predicting the object. Though not optimal for classification problems, the Yolo v1 loss is basically weighted linear regression.

Following [1], we denote the weight on the bounding box coordinates b̂ as λ_coord, which is set to 5 in [1], and the weight on ĉ₁and ĉ₂ when the corresponding box do not contain objects as λ_noobj, which is set to 0.5 in [1]. To make equations easier, let Λ_coord be the 4×4 matrix with of all zeros except for λ_coord repeated on the diagonal. Then we can construct the Yolo v1 loss for the grid cell in Figure 3 with the object assigned to box 1 as in Figure 4.

For when the grid cell has no object assigned, we only have the no object loss for both bounding boxes.

With L_(gx,gy) defined for all cases of grid cells, to get the loss L for the whole image, we sum the losses from all the grid cells.

We can see that this is equivalent to the formula of the loss for one image as presented in [1] and shown in Figure 1.

2.4 Yolo v2 final layer and loss function

The main changes to the last layer and loss function in Yolo v2 [2] is the introduction of “prior boxes’’ and multi-object prediction per grid cell.

The Yolo v2 prior boxes were inspired by the anchor boxes used in Faster RCNN [6] (a multi-stage, deep object detector), but use a different anchor box encoding, which is probably why [2] called them prior boxes instead. A prior box is a width and height that [2] chose by running k-means clustering on all truth bounding boxes from the PascalVOC (and COCO) dataset. Instead of predicting the width and height of a bounding box directly, Yolo v2 predicts width and height offsets relative to a prior box. The center coordinates for each bounding box prediction remain the same as in Yolo v1. Yolo v2 had 5 priors, but that makes creating diagrams and notation painful, so we limit it to 2 in our discussion and we also use a 7×7 grid for the same reason.

In Yolo v1, each grid cell in the last layer can predict just one object because, while each grid cell gives us a choice between two bounding boxes, we only have one class probability vector. In Yolo v2, each grid cell predicts a bounding box and class probability vector for each prior box. Suppose we have two prior boxes, then ŷ_(gx,gy) will be a 50-dimensional vector since, for each prior, we predict 25 numbers: the probability the box contains an object, four numbers to represent the bounding box coordinates in relation to the prior box and the 20-dimensional probability vector for PascalVOC.

The Yolo v2 loss function is not explicitly described in [2], but we can infer from the Yolo v1 loss function. While there are now multiple object predictions per grid cell, Yolo v2 still performs a max-IOU matching of truth to predicted bounding box. It is expected that the bounding box coordinate loss is still a weight linear regression loss. However, the Yolo v3 tech report mentions using binary cross entropy loss for the class predictions and Yolo v2 mentions classification loss, which we infer to mean not regression loss, so Yolo v2 probably uses binary cross entropy. Since the loss for when there is no object assigned to the grid cell is the same as in Equation 5, we only show the Yolo v2 loss for the grid cell in Figure 3 where z is the same as in Equation 4 except with the new box encoding.

Yolo v3 final layer

The main change in Yolo v3 tech report is to the final layer, which was inspired by Feature Pyramid Networks (FPNs) [7]. The Yolo v3 [8] final layer consists of three detection tensors, each with its own prior boxes and each twice the resolution of the previous, e.g. if each detection tensors has two prior boxes and the dataset is PascalVOC, then a possible first tensor size is 7×7×50, which means the second is 14×14×50 and third is 28×28×50— which is partially why the authors called this “Prediction Across Scales”. Each detection tensor is organized like the Yolo v2 final layer, so the concept of grid cells and object assignment applies and while not explicitly mentioned, we can infer that, like all the previous Yolo incarnations, each object is still only assigned to one grid cell in one detection tensor.

While the details of how to get from the feature extractor to the three detection tensors is a bit lacking, we came up with a diagram of our interpretation of section 2.3 of [8] and, through this, we can see the other reason the Yolo authors called this “Prediction Across Scales”. Yolo v3 merges earlier layers in the feature extractor network with later layers (the extra CNN layers), which is essentially what FPNs do. Intuitively, small objects are more easily detected in high resolution early layers than in the significantly, subsampled low resolution later layers, but the early layers of a CNN contain semantically weak features, so rather than use them directly, FPNs merge them with upsampled later layers that contain semantically strong features.

We can see from Figure 6 that Yolo v3, unlike FPNs, uses concatenation instead of summation to merge layers and, while not mentioned, Yolo v3 probably upsamples the same way as FPN (using nearest neighbor). In addition the Yolo v3 structure isn’t quite the same as the FPN in [7], since Yolo v3 doesn’t use the result of previous merges to produce the next detection tensor. To be more clear on the differences, we produced a diagram of our interpretation of what Yolo v3 would have looked like if it followed FPN structure more closely.

The Yolo v3 paper reported experimenting with the loss function, such as using Focal Loss [9], which when combined with a Single Shot Detector [10] (a one-stage detector like Yolo) and FPN resulted in a fast and accurate detector that [9] called RetinaNet. However, none of those experiments improved detection, so the loss is essentially the same as in Yolo v2.

References

Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. CoRR, abs/1612.08242, 2016.
Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fis- cher, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. CoRR, abs/1611.10012, 2016.
Mark Everingham, Luc Van Gool, C. K. I. Williams, J. Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge, 2010.
Joseph Redmon. YOLO CVPR 2016 talk and slides. Google slides: https://docs.google.com/ presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz9Yz2oh4-GTdX6M/edit#slide=id.p, Youtube: https://www.youtube.com/watch?v=NM6lrxy0bxs. Accessed: 2018–10–12.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), 2015.
Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal Loss for Dense ObjectDetection. In Proceedings of the International Conference on Computer Vision (ICCV), 2017
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.