YOLO_v2 and YOLO9000 Part 1
This is a two part series on the successors of the first version of the object detection neural network called YOLO_v1. Both the networks are an improvement on the first version. They retain most of the fundamentals of their predeessor with major improvements in some areas. So, I highly encourage you to go through the posts on YOLO_v1 before you proceed with this series.
Introduction
Image detection has always been a tougher problem to deal with comapred to image classification. The biggest problem lies in creating a technique to accurately localize multiple objects in an image while maintaining a good amount of speed so that it can be used in real time inference. YOLO_v1 solved the speed issue which plagued the previous object detection models by creating a single convolutional network to carry out the detection rather than use multiple models bundled as one.
Creating large, well labelled datasets for image classification problems is relatively easier than creating ones for detection. This is due to the difference in the nature of the two problems. Classification datasets only need to be labelled with the correct class and they are ready to be used for training (after necessary pre-processing ofcourse) whereas datasets for detection require labels along with bounding box outlines and co-ordinates to be created before they can be used for training. As a result, today there are way more classification datasets available than detection datasets.
In this paper, the authors designed a special training algorithm to make use of the large classification datasets to train the model. It essentially implements the following techniques:
- An hierarchical view of object classification by creating tree like data structures to combine distinct datasets together
- Train object detectors on both detection and classification datasets that is, use labelled detection data to learn to precisely localize objects and use classification data to increase the vocabulary and robustness
However, before this algorithm is used, there are other improvements made to YOLO_v1. Thus, accuracy and speed improvements made on YOLO_v1 yield the model called YOLO_v2. YOLO9000 is obtained by using the special training algorithm on YOLO_v2. The imporvements made to YOLO_v1 can be put into two categories: (a) accuracy and (b) speed.
Accuracy Improvements
The first version of YOLO suffered from localization errors and low recall. Recall is the fraction of the retrieved results (true positives) over the total amount of available relevant instances (true positives + false negatives). Both of these were fixed while maintaining the classification accuracy by the following:
Batch Normalization
This technique is used to increase the stability of the neural network by maintaining a uniform scale for all the features preventing the feature with very large values to dominate(sway) the values of the weights in its favor. The output of a layer is normalised by subtracting the mean from every example in the batch and then dividing it by the standard deviation of the batch.
In YOLO_v2 batch normalization is applied to every convolution layer. This enables every layer to of the network to learn by itself a little bit more independently than other layers[2]. This reduces the amount by which the weight values of the hidden layers have to shift around to accomodate different distributions of data. Batch normalization also helps in regularization by adding some noise. It was observed for YOLO that batch normalization improved the mAP score by 2% due to which dropout regularization was removed from the network. The figure below shows the batch normalization process.
Hence, batch normalization benefits the model in the following ways:
- Allows to use higher learning rate to speed up the learning process as it ensures that no activation is to high or low
- It reduces overfitting by adding some regularization by adding some noise to the activations
Higher Resolution
The first version trained the classifier part of the model on 224x224 inputs. YOLO_v2 trains the classifier on inputs of dimensions 448x448 for the first 10 epochs to improve the prediction results by giving time to the model to adjust its filters for higher resolution samples. The model is later fine tuned for detection. This increased the mAP score by almost 4%.
Convolution with Anchor Boxes
The first version of YOLO uses fully connected layers for predicting bounding boxes on top of the convolutional layers which act as feature extractors. This placed limitations on the number of objects it could detect per grid cell, namely 1 object. It also resulted in lower accuracy of localizing small objects or groups of small objects. YOLO_v2 borrows the idea of anchor boxes from its predecessor Faster-RCNN to improve localization. Faster-RCNN is an improvement over the object detection architecture F-RCNN (Fast RCNN). It propopsed the idea of using convolutions to predict the most favourable regions in an image where objects can be located. F-RCNN used a separate algorithm called Selective Search to identify potential areas of interest to be scanned to find objects. This helped increase the speed as the network didn’t need to scan the entire image to look for the objects.However, it was still slow for real time detection. Faster-RCNN improved upon this by using a separate network called Region Proposal Network (RPN) which consisted of convolutional layers to predict these areas of interest.
Anchor boxes are predefined bounding boxes which are used as guidelines by the network to predict offsets for bounding boxes from the anchor box under consideration. The RPN in Faster-RCNN was the first to propose and use the technique of anchor boxes. In the default configuration of Faster-RCN, 9 anchor boxes are defined at every position in the image as shown in the image below. Again, these boxes serve as guidelines for the model to predict the bounding boxes for objects it detects. In Faster-RCNN, these are handpicked to suit the dataset the model trains on. This means that the number and dimensions of the anchor boxes depends on the dataset being used.
YOLO_v2 removes the fully connected layers and uses the concept of anchor boxes for bounding box regression. The removal of the layers yields a 416x416 output which is converted into a 13x13 grid as the downsampling factor is 32 (in YOLO_v1 the 224x224 output is converted to a 7x7 grid). It uses convolutional layers to predict offsets at every location in a feature map. However, there are two main problems with using anchor boxes in YOLO.
Dimension Clusters:
As stated above, anchor boxes are the initial sizes (height, width) which are used as guidelines by the model to predict bounding boxes’ offsets. The model gradually learns to predict the accurate ones as the training progresses. It is important to note that the model doesn’t predict the final size of the object. It learns to adjust the size of the nearest anchor box to the size of the object.
It was observed that unlike Faster-RCNN, handpicking these guidelines (anchor boxes) lead the model to gradually learn the correct boxes over a longer time. To speed up the process of learning more accurate boxes, the anchor boxes were not chosen by hand instead chose by running a K-means clustering algorithm on all the dataset. This helped create better starting points for YOLO_v2 to start learning. The algorithm also gives us the ideal number of anchor boxes that the model should use. It was found out that this number is 5. Hence, YOLO_v2 uses 5 anchor boxes.
The k-means clustering algorithm uses normal euclidean distance as the error function which is not ideal to be used to calculate the loss to choose the correct starting anchor boxes. This is because the euclidean distance would result in larger boxes to yield larger errors compared to the smaller boxes. This is undesirable as we use IOU (Intersection over Union) with the ground truth as the metric to determine the correct boxes. Hence, the k-means algorithm to compute the number and dimensions of the anchor boxes uses IOU scores as shown:
distanceMetric(box, centroid) = 1 — IOU(box,centroid)
This proves to be a better metric and yields better starting points for the model to begin with.
Model Instability
The anchor boxes are used to predict offsets of the bounding boxes. So, a mo del predicts quantities like tx and ty which are the translational shifts with respect to the anchor box with dimensions (wa,ha) and co-ordinates of the center (xa,ya). Then the co-ordinates of the boxes (x,y) are computed as follows:
x = (tx * wa) — xa
y = (ty * ha) — ya
The problem with this is that the formulation is unconstrained which means that an anchor box can end up at any point in the image regardless of the location that predicted it [1]. It is important to note that offsets are predicted with respect to the anchor boxes. Since, any anchor box can be present at any point regardless of the location that predicted it, it takes a while for the model to predict the proper bounding boxes’ offsets with random initializations especially in the initial iterations.
So, instead of predicting offsets, YOLO_v2 adopts the approach of YOLO_v1 to predict bounding box co-ordinates with respect to the grid cells to bind the ground truth to fall between 0 and 1 using logistic activation (sigma) [1]. This is a constrained formulation which provides better localization. The network produces 5 boxes at each grid cell in the output feature map. Five co-ordinates are produced for each box, tx, ty, th, tw and to.
If the cell is offset from top left corner of the image by (cx, cy) and the pre- defined anchor box has pw width and ph height, then the predictions are
bx = sigma(tx) + cx
by = sigma(ty) + cy
This is the center of the bounding box being predicted.
bw = pw * exp(tw)
bh = ph * exp(th)
These are the dimenaions of the box.
Pr(obj) * IOU(b, obj) = sigma(to)
This is the class confidence score of the box ‘b’.
Thus, dimension cllusters along with directly predicting the bounding box co-ordinates improves the accuaracy by 5% over the version with hand-picked anchor boxes.
Fine Grained Features
The 13x13 grid cell is good for larger objects however not for the accuarte detection of smaller objects. To rectify this YOLO_v2 adds a passthrough layer that introduces features from an earlier layer at 26x26 resolution. The higher and lower resolution featuresare combined by stacking them into adjacently into channels. This allows a more fine grained access to the features and allows better accuracy for detecting smaller objects.
Multi Scale Training
YOLO_v2 uses higher resolution inputs of 448x448 to train the classifier. As a result, the grid cell map is created from a 416x416 feature map rather than a 224x224 one. The model uses only convolutional layers which gives it the flexibility of being resized at the fly. To increase the robustness of the network, instead of fixing the input size to 448x448, the network is changed after every few iterations. For example, after every 10 batches a new input dimension is chosen at random. Since, the network reduces the size of the input by a factor of 32, the random sizes are chosen from a set {320, 352, … , 608} sizes.
This helps the network to predict detections on different resolutions. It becomes robust to the different sizes of input images. Yolo_v2 runs faster but with less accuracy on smaller batches and slower but more accurately on larger batches.
This concludes part 1 of the series. I will cover the speed improvements made to YOLO and the creation of YOLO9000 in the next part.
If you like this post or found it useful please leave a clap!
If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.
References
[1] https://arxiv.org/pdf/1502.03167v3.pdf
[2] https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c
[3] https://medium.com/@smallfishbigsea/faster-r-cnn-explained-864d4fb7e3f8