YOLO v3 Explained
Introduction
Object detector is a combination of an object locator and an object recognizer. A sliding window was used to look for objects at different locations and scales in traditional computer vision approaches. Because this was such an expensive operation, the object’s aspect ratio was usually assumed to be fixed. Early Deep Learning based object detection algorithms like the R-CNN and Fast R-CNN used a selective search method to narrow down the number of bounding boxes that the algorithm had to test. Another approach called Overfeat involved scanning the image at multiple scales using sliding windows-like mechanisms done convolutionally; then, it was followed by Faster R-CNN, which used a Region Proposal Network (RPN) for identifying bounding boxes that needed to be tested. By clever design, the features extracted for recognizing objects were also used by the RPN for proposing potential bounding boxes, thus saving a lot of computation. YOLO, on the other hand, approaches the object detection problem in a completely different way. It forwards the whole image only once through the network. SSD is another object detection algorithm that forwards the image once through a deep learning network, but YOLOv3 is much faster than SSD while achieving very comparable accuracy.
Discussion
Darknet-53 network architecture
YOLO v3 uses a network called Darknet-53. It is also referred to as a backbone network for YOLO v3. Its primary job is to perform feature extraction. It has 53 convolutional layers, as it is shown in figure 1. It contains residual connections, which basically means that from the previous blocks, we have residual connections here (residual connections come from ResNets). When we have a very deep neural network, residual or skip connections help me avoid overfitting. Each layer has a stride of 2 to downsample the feature maps followed by batch normalization layer and Leaky ReLU activation, and there is no form of pooling layer except at the last layer before the fully connected layer there is an average pooling layer, but it won’t be included in the YOLO architecture as they mentioned in YOLO v3 paper.
Output Format
First of all, YOLO v3 divides the input image into a grid of dimensions equal to that of the final feature map. YOLO v3 importance is that it makes detections at three different scales. Detections at different layers help address the issue of detecting small objects by concatenating them with the previous layers to preserve the fine-grained features. The detection is done by applying 1x1 detection kernels on feature maps of three different sizes at three different places in the network using respective anchors. The detection layer predicts n_anchors * (5 + n_class) values for each cell in the feature map using a 1x1 kernel. YOLO v3 has three anchors for each scale, which predicts three bounding boxes per cell (n_anchors = 3). Regarding (5 + n_classes ) means that we will predict four coordinates of the box plus its confidence score (the probability of containing an object) and class probabilities for each of three anchors. In this case, they are 80 classes as in the COCO dataset used in training the network, as shown in figure 2. For each bounding box (anchor box) is represented by 5 numbers + number of classes (pc,bx,by,bh,bw,n_class). So the shape of the detection kernel is 1 x 1 x 255 where 255 = (4 for bounding box coordinates(bx,by,bh,bw) + 1 for probability of object(pc) + 80 for probabilities of classes in the COCO dataset(n_classes)) x 3 for each anchor box.
Bounding box
When predicting the bounding box’ width and height, most modern object detectors predict log-space transforms. Then, these transforms are applied to the anchor boxes to obtain the prediction. Anchors are sort of bounding box priors that were calculated on the COCO dataset using k-means clustering. The width and height of the box are predicted from cluster centroids relative to the whole image using a sigmoid function.
The following formula describes how the network output is transformed to obtain bounding box predictions:
bx=σ(tx)+cx
by=σ(ty)+cy
bw=pwetw
bh=pheth
Here, bx, by, bw, bh are the x,y center coordinates as it is shown in figure 3. The network outputs (tx, ty, tw). The top-left coordinates are cx and cy, while pw and ph are anchor dimensions for the box. We expect each cell of the feature map to predict an object through one of its bounding boxes if the object’s center falls in the receptive field of that cell; this has to do with how YOLO is trained where only one bounding box is responsible for detecting any given object.
Detection Layers
The first detection is made by the 82nd layer, which is responsible for detecting large objects. The network downsamples the input image until the first detection layer, where detection is made using feature maps of a layer with stride 32. For example, if we have an image of 416 x 416, the resultant feature map would be of size 13 x 13. One detection is made here using the 1 x 1 detection kernel, giving us a detection feature map of 13 x 13 x 255, as shown in figure 4. In order to concatenate with shortcut outputs from Darknet-53 before applying detection on a different scale, we are going to upsample the feature map by a factor of 2 and concatenate with feature maps of previous layers having identical feature map sizes. Another detection is now made at layer with stride 16, which is responsible for detecting medium objects giving us a detection feature map of 26 x 26 x 255. The same upsampling procedure is repeated. A final detection is made at the layer of stride eight, which is responsible for detecting small objects giving us a detection feature map of 52 x 52 x 255.
Thresholding & NMS
For an image of size 416 x 416, YOLO predicts ((52 x 52) + (26 x 26) + (13 x 13)) x 3 = 10647 bounding boxes as it shown in figure 5 however the image has a few number of bounding boxes for clarity. To reduce the detections from 10647 to the actual number of objects, we follow three steps of thresholding, Non-maximum Suppression (NMS) with Intersection over Union (IoU).
First, we filter boxes based on their abjectness score. If an object has a score below a threshold is ignored (for example, below 0.7), as shown in figure 6. Object score represents the probability that an object is contained inside a bounding box. It should be nearly 1 for the red and the neighboring grids, whereas almost 0 for the grid at the corners. The abjectness score is also passed through a sigmoid, as it is to be interpreted as a probability.
Still, the image has some overlapping boxes. Which one should we keep, and which one should we reject? Non-maximum Suppression (NMS) intends to cure the problem of multiple detection boxes of the same image by using Intersection over Union (IOUs ). For every pair of boxes, we will compute IoU. The box with higher IOUs than others will capture much of the information present in these other boxes, as shown in figure 7. So, we will only have that box with the maximum IOUs compared with other boxes.
Finally, we got off all other extra boxes after applying thresholding and non-maximum suppression with intersection over the union. We have a single bounding box corresponding to a single object in the image, which is the car, as shown in figure 8.
Conclusion
YOLOv3 is a good detector. It is fast; it is accurate. However, It is not the greatest detection algorithm out there. It uses darknet-53 with 53 convolutional layers network as its backbone. It starts with dividing the image into grids, and then three bounding boxes are predicted for each grid. On the COCO dataset, three boxes at each scale. Therefore, the output tensor is N×N×[3×(4+1+80)], i.e., four bounding box offsets, one objectness prediction, and 80 class predictions. K-means clustering is used here as well to find better bounding boxes prior to the COCO dataset; these anchor boxes (10×13), (16×30), (33×23), (30×61), (62×45), (59×119), (116×90), (156×198), and (373×326) are used. Next, the feature map is taken from 2 previous layers and is upsampled by 2×. Finally, a feature map is also taken from earlier in the network and it merges with our upsampled features using concatenation. Then, a few more convolutional layers are added to process this combined feature map and eventually predict a similar tensor, although now twice the size. Then, thresholding filtering is used to reduce the number of predicted bounding boxes, followed by non-maximum suppression with intersection over the union.
References
[1] Joseph Redmon Ali Farhadi (2018), YOLOv3: An Incremental Improvement, Retrieved December 19, 2020, from https://arxiv.org/pdf/1804.02767v1.pdf
[2] Rosina De Palma (2018), YOLOv3 Architecture: Best Model in Object Detection, Retrieved December 19, 2020, from https://bestinau.com.au/yolov3-architecture-best-modelin-object-detection/
[3] Rokas Balsys (2019), YOLO v3 theory explained, Retrieved December 19, 2020, from https://pylessons.com/YOLOv3-introduction/
[4] Ayoosh Kathuria (2018), What’s new in YOLO v3, Retrieved December 19, 2020, from https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b
[5] Sik-Ho Tsang (2019), Review: YOLO v3, Retrieved December 23, 2020, from https://towardsdatascience.com/review-yolov3-you-only-look-once-object-detectioneab75d7a1b