# [Review] 6. YOLO ver 2

# 1. Contribution

- proposed a method to jointly train
*YOLO2*simultaneously on*COCO*detection dataset and ImageNet classification dataset, which enables the model to predict object classes that don’t have labeled detection data. - instead of simply scaling up or ensembling multiple models,
*YOLO2*focused on simplifying the network while keeping the accuracy. - applied various ideas from previous works to improve
*YOLO* - introduced anchor-box-based prediction mechanism into
*YOLO2*, inspired by Faster R-CNN. - used
*K-*means algorithm to find optimal anchor shapes given the cluster number*K*, which corresponds to the number of anchor shapes. - introduced DarkNet-19.

# 2. Limitations of YOLO

- produces a notable number of localization errors, compared to Fast R-CNN
- has a relatively low recall in comparison to two-stage (region-proposal-based) methods.
- has a discrepancy of image resolutions when training its classifier (224
*⨯224*) and training detection module (*448⨯448*).

# 3. Applied methods for YOLO2 to improve from YOLO

In order to improve YOLO1, authors leveraged novel ideas from previous researches.

**3.1 Batch Normalization**

By adding a batch normalization layer before each convolutional layer, YOLO2’s mAP got higher by 2%, according to the authors. Further, it was significantly helpful to accelerate the convergence and eliminate all the other forms of regularization, including L2 and Dropout.

## 3.2 Training Classifier with High-resolution Input

The original classifier network of YOLO is trained on ImageNet with image size *256⨯256,* while image resolution of *448⨯448 *is used as input to YOLO for object detection task. Therefore, this leads to the fact that YOLO has to learn to detect objects while adjusting its parameters for new image resolution.

In order to fix the discrepancy of image sizes, *YOLO2* classifier is trained on an image scale of *448⨯448* from the first place, yielding an increase of mAP by 4%.

## 3.3 Convolutional with Anchor Boxes

YOLO1 directly predicts the coordinates of bounding boxes using FC layers preceded by convolutional layers. However, as is observed in Fast R-CNN and Faster R-CNN, predicting offsets and confidences for anchor boxes, instead of directly predicting the coordinates, simplifies the task and stabilizes the training process.

Getting inspired by this, YOLO2 also introduced an anchor-box-based approach.

- Firstly, they removed the two FC layers on top of the convolutional layers from YOLO1.
- Secondly, in order to maintain the high resolution of feature maps at the last feature layer, they eliminated one pooling layer at the end.
- Lastly, they shrank the network to operate on
*416⨯416*, which leads to the odd feature map size of*13⨯13*.

YOLO2 without anchor boxes reported mAP of 65.9% with a recall of 81%. That of anchor boxes showed 69.2% mAP and 88% recall. Although mAP gets slightly decreased by anchor box, **recall increases remarkably by 7%.**

**3.3.1 why odd number?**

As YOLO2 removed the FC layers from YOLO1, the receptive field of grid pixels at the last feature layer cannot cover the entire image. By making the feature map’s size odd at the last feature layer, the pixel in the center of the feature map becomes responsible and able to detect a large object, which is often located nearby the image center.

Further, as we reduce the input image size from 448⨯448 to 416⨯416 in the process of obtaining the odd feature map size, it decreases the number of computations and results in faster inference speed.

## 3.4 Anchor box selection with K-means

When using anchor boxes for object detection, one issue we often encounter is how to find a suitable number of anchor shapes (size, aspect ratio, etc). In many cases, the anchor shapes are manually designed, and frequently such design choice ends up with 9 shapes (3 sizes and 3 aspect ratios, resulting in 3×3 = 9).

Instead of choosing priors by hand, *YOLO2* runs k-means clustering on the training set bounding boxes to automatically search for appropriate anchor shapes with varying ** k**.

In order to run the k-means algorithm, a distance metric is required. Thus, authors defined their own metric for the purpose of finding appropriate anchor shapes maximizing the average IoU between GT box and its nearest anchor shape, as defined below.

Following Fig 1 shows the average IOU of GT bounding boxes to their best matching anchor, with respect to ** k**. Considering the tradeoff for recall and complexity of the model, the number of clusters

**= 5 is chosen.**

*k*Surprisingly, anchor shapes found by k-means when k = 5 showed a slightly better average IoU than that with 9 anchor shapes used in Faster R-CNN. Moreover, when k = 9, it obtained a completely outperforming IoU by 7%, as is shown in the following Table.

## 3.5 Location Prediction

Instead of predicting offsets of GT box relative to an anchor box at a grid cell, *YOLO2* directly predicts the location of GT boxes inheriting the approach of *YOLO1*.

Mathematically, this can be formulated as below.

Note that YOLO2 predicts 5 coordinates for each bounding box, where (*t***ˣ, ***t***ʸ, ***t***ʷ, ***t***ʰ**)** **indicates the predicted bounding box location, and *t***º **is the objectness parameter. In order to prevent the instability of early training phase, which mostly occurs due to the unbounded predicted bounding box coordinates, the sigmoid activation 𝜎 is applied to ensure that the final network outputs 𝜎(*t***ˣ**)**, **𝜎(*t***ʸ**) fall between 0 and 1.

*c***ˣ **and** ***c***ʸ **are the coordinates of grid cell where the network predictions are made, p**ʷ**, p**ʰ** are anchor width and height, respectively.

Lastly, (*b***ˣ, ***b***ʸ, ***b***ʷ, ***b***ʰ**) represents the target GT box coordinates.

**3.5.1 personal opinion on YOLO2’s location prediction**

For predicting (*b***ˣ, ***b***ʸ**) an ideal prediction is made when 𝜎(*t***ˣ**) and 𝜎(*t***ʸ**) are 0.5 (as *b***ˣ **= 0.5** **+** ***c***ˣ, **same for** ***b***ʸ). **In other words,** ***t***ˣ **and** ***t***ʸ **are to be zero, ideally. This can be relatively easily achieved by setting zero-mean weights and zero bias for the last convolutional layer before sigmoid 𝜎, ensuring the stability of training as 𝜎(*t***ˣ**) and 𝜎(*t***ʸ**) fall between 0 and 1.

Similarly, the perfect prediction for (*b***ʷ, ***b***ʰ**) is made when *t***ˣ **and** ***t***ˣ **are zero (due to the exponent)**. **However,** **there is no sigmoid activation for (*t***ʷ, ***t***ʰ**), and therefore the exponential term is unbounded. This might lead to instability in training. Therefore applying sigmoid (with Conv layer of zero-mean weights and negative bias) or tanh activation (with Conv layer of zero-mean weights and zero bias) might be helpful.

## 3.6 Finer-grained Features

As mentioned before, *YOLO2* makes predictions at *13⨯13* feature maps, which is sufficient for detecting large objects but often not enough for localizing small objects.

While other networks, like *SSD*, predicts at various feature layers (e.g. at 52*⨯52, 26⨯26, 13⨯13) *which enable them to detect small objects better, *YOLO2* applied another approach: **adding a passthrough layer to integrate feature information from previous feature layer to that of next layer.**

In other words, feature maps of *26⨯26 *resolution is concatenated to the next* 13⨯13 *feature layer.

Of course, due to the size difference they cannot directly stacked. Therefore, the authors stack adjacent features of *26⨯26 *feature maps into different channels then concatenate them into feature maps in the next layer. Eventually, YOLO2 architecture with a passthrough becomes like the figure below.

# 4. Conclusion

As is shown in the table below, YOLO2 achieves comparable (or even outperforming) mAP to all the other networks of one and two-stage methods.

Moreover, it is a lot faster (2x ~ 10x) than them.

# 5. Summary

# 6. Reference

[1] https://arxiv.org/abs/1612.08242