YOLO_v2 and YOLO9000 Part 2

Published in

Escapades in Machine Learning

5 min readJul 26, 2018

In part1 of the series, I explained how the accuracy of YOLO_v1 was improved by multiple tweaks to the architecture. The most important change was the use of convolutional anchor boxes for predicting bounding boxes which improved the accuracy dramatically. In this part I will discuss the improvements made to the initial architecture to increase the speed. Also, I will go over the special training method and data grouping used to train the YOLO9000 model.

Speed Improvements

Most detection frameworks use VGG-16 as the base feature extractor. This means that they build on the pretrained weights of VGG-16 which specialise in extracting various features of objects. The problem with using VGG-16 is that though it is accurate, it is very complex. It requires an approximate of 30.69 billion floating point operations for a single forward pass over a 224x224 input image [1]. Yolo_v1 is based of Googlenet which offers more speed by reducing the number of floating point operations for a single forward pass to 8.52 billion at the cost of accuracy.

Yolo_v2 uses a completely different network as its base feature extractor called darknet-19. It is a custom made classifier by the creators of Yolo. It uses the 3x3 filters and doubles the number of channels after every pooling step just like VGG-16. It uses global averaging pools, 1x1 compression filters like Network in Network(NIN). The darknet-19 has 19 convolutional layers and 5 max-pooling layers. It only performs 5.58 billion floating point operations for a single forward pass.

Joint Training on Classification and Detection Data

This procedure involves two datasets, one with images labelled for detection and the other labelled for classification. The detection data is used to learn bounding boxes and objectness ( the probability of finding an image window to contain an object of any class as opposed to backgrounds). The classification data is used to expand the number of classes the model can detect.

In Yolo_v2, a mix of both datasets is used as follows:

For detection labelled images backpropagation is based on the full Yolo_v2 loss function
For classification labelled images backpropagation is based on only the loss for the classification part in the function

Challenges based on mixing datsets

The detection labels are always more generic than the classification labels. For example, detection labels might be between different objects such as dogs, cats etc whereas classification labels are more specific like different breeds of dogs say Norflok Terrier, Golden Retreiver etc.

Softmax used for classification assumes all classes are mutually exclusive. This presents a problem when the detection and classification datasets are mixed. Softmax fails to produce accurate results as the classification labels are not mutually exclusive with the detectioin labels for example, Norflok Terrier is a subset of dogs. Thus, the intersection of these labels is not a null set. So, instead of using a multi-class structured model (which uses softmax and assumes mutual exclusion amongst classes), a multi-label model is used to combine the datasets as it doesn’t assume mutual exclusion. It discards the mutually exclusive classes structure in the classification dataset before combining it with the detection dataset.

Hierarchical Classification

Yolo_v2 uses the ImageNet dataset for classification training. The labels used in this dataset are obtained from a language structure directed acyclic graph (DAG) called WordNet. The labels are structured on the basis of how concepts relate to each other. Most approaches for classification use a flat structure which works perfectly for a single dataset which is composed of mutually exclusive classes. A single softmax operation is sufficient for such datasets. This is shown in the figure below.

Fig1: Most ImageNet models use a single, large softmax for different class predictions [1]

To merge different datasets, complex structures like DAGs are required as the simple flat ones fail to effectively combine them. For the Yolo_v2 model, instead of using the entire DAG from WordNet, a simple WordTree structure is created from the concepts in ImageNet. The tree is built as follows:

Choose the visual noun from the ImageNet labels and locate it in the WordNet dataset
Trace the path from the noun to the root of the tree in WordNet, usually there is one root called “physical object”
Mostly there is a single path from anoun to the root. In case there are more than one, then the shortest one is chosen. The idea is to grow the tree as little as possible
All the single path nouns especially the ones that belong to the same synonym sets (synsets)are added to the WordTree to be constructed
Then the remaining nouns are iteratively searched in the WordNet DAG, and all those paths are added to the tree

The final result of the steps mentioned above is a WordTree which is an hierarchical model of the labels. For classification, the conditional probabilities are computed at every node(label) of the tree for the probability of a hyponym (synsets that are more closely related) of that synset. For example, the figure below demonstrates this.

At the Terrier node in the above tree the prediction of each label is computed as follows:

Pr(Norflok|Terrier) and Pr(Bedlington|Terrier)

To compute the absolute probability for a particular node, all the conditonal probabilities at each node in the path from that node to the root are multiplied together. Hence,

Pr(Norflok|Physical Object) = Pr(Norflok|Terrier) * Pr(Terrier|Dog) * Pr(Dog|Mammal) ….. Pr(Physical Object)

For classification, Pr(Physical Object) = 1. This is the assumption that every image has an object present. The following image depicts how multiple softmax operations are used for the entire structure for predictions.

Fig3: Multiple softmax operations on co-hyponyms [1]

For detection , the assumption of Pr(Physical Object) is discarded. The objectness detector used in Yolo is used to compute Pr(Physical Object). It predicts the a bounding box and the tree of probabilities. The tree is traversed downwards from the root. The highest confidence path is chosen at every split untill a specified threshold is reached or the specific object class is predicted. The figure below shws how the detection and classificatioin datasets are merged using the tree structure.

Fig4: Merging classification dataset(ImageNet) and detection dataset(COCO) using WordTrees [1]

Yolo9000

It uses the base of Yolo_v2 but with 3 anchor boxes instaed of 5. It is trained over the combined WordTree structure obtained after merging the classification and detection datasets. The COCO detection dataset and 9000 classes from ImageNet are merged. Since, the number of labels in Imgaenet is more than that in the COCO dataset, the COCO dataset is oversampled to create a balance.

For calssification, only the classification loss is back propagated by finding the bounding box that predicts the highest probability for that class and computing the loss on that predicted tree.

This concludes this series.

If you like this post or found it useful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.

References

[1] https://arxiv.org/pdf/1612.08242.pdf