Review On YOLOv2

Arun Mohan
Jun 2 · 8 min read

In this article I will write a short review on YOLO9000 paper published by Joseph Redmon and Ali Farhadi. They introduced YOLOv2 and YOLO9000, real-time detection systems. YOLO9000 can detect over 9000 object categories. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art methods like Faster R-CNN with ResNet and SSD while still running significantly faster. I recommend to read my review on YOLOv1 before reading this.

I will divide the article to 3 sections

  1. Improvements over YOLOv1
  2. Darknet-19
  3. YOLO9000

1. Improvements over YOLO v1

  • Batch normalization is used in all convolution layers in YOLOv2.
  • This improved the mAp by 2%.

Original YOLO was trained as follows.

Trained the classifier on 224 x 224 .Then they increased the resolution to 448 for detection.

While for YOLOv2 ,

They initially trained the model on images at 224×224 ,then they did a fine tuning on classification network at the full 448×448 resolution for 10 epochs on Image Net before training for detection. So the network got time to adjust its filters to work better on higher resolution input.This improved mAp by 4%.

Image for post
Image for post

YOLO v1 tries to assign object to the grid cell that contain middle of the object.In the above image,we can see that the yellow grid cell contains the middle point of both car and the girl.But since the grid cell can detect only one object the problem arise.To solve this ,the authors tried to allow the grid cell detect more than one object using k bounding box(In YOLO v2).

Image for post
Image for post

To predict k-bounding boxes YOLO v2 uses the idea of anchor boxes.

  • YOLO predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. In YOLO v2 all fully connected layers are removed and uses anchor boxes to predict bounding boxes.
  • One pooling layer is removed to increase the resolution of output
  • 416x416 images are used to train detection network and a 13x13 feature map is obtained.ie, they are down sampled by a factor of 32.
  • Thus we make coordinates and confidence score(objectness prediction) prediction for each anchor boxes. Following YOLO, the objectness prediction still predicts the IOU of the ground truth and the proposed box and the class predictions predict the conditional probability of that class given that there is an object
  • Using anchor boxes there is small decrease in accuracy. Without anchor boxes our intermediate model gets69.5mAP with a recall of 81%. With anchor boxes our model gets 69.2mAP with a recall of 88%.

They run k-means clustering on all bounding boxes for various values of k and plot the average IOU with the closest centroid. The important thing is that instead of using Euclidean distance they used iou between bounding box and the centroid.Using standard Euclidean distance based k-means clustering is not good enough because larger boxes generate more error than smaller boxes.

They got best results at k=5. They used the following formula to find distance between bounding box and centroid:

d(box,centroid) = 1−IOU(box,centroid)

Image for post
Image for post

The left image shows the average iou and right image shows relative coordinates of VOC and COCO.

  • YOLO v1 has no constrains on location predictions.This makes the model unstable at earlier iterations.
  • YOLO v2 bounds the location using logistic activation σ, which makes the value fall between 0 to 1.
  • The network predicts 5 bounding box for each cell. It predicts 5 coordinates for each bounding box tx,ty,tw,th and to. If the cell is offset from the top left corner of the image by (cx,cy) and the anchor box has width and height pw, ph, then the predictions correspond to:
Image for post
Image for post
Image for post
Image for post

For example if we use 2 anchor boxes on a particular grid cell, it will output two boxes(assume it as a blue one and red one). Now take case of blue box. we assign this box not only to the grid cell, but also to the anchor box(dotted box in above image) that has maximum iou with it.

  • 13 x 13 feature map is sufficient for detecting larger objects.
  • But in order to detect smaller objects, the 26×26×512 feature maps from earlier layer is mapped into 13×13×2048 feature map, then concatenated with the original 13×13 feature maps for detection.
  • This improves mAp by 1%.
  • YOLOv1 uses input resolution of 448x448 for detection training.
  • However, in YOLOv2 since our model only uses convolutional and pooling layers it can be resized on the fly.
  • For every 10 batches, new image dimensions are randomly chosen. The image dimensions are chooses from {320, 352, …, 608}.
Image for post
Image for post

2. Darknet 19

To overcome the problem of complexity and accuracy the authors propose a new classification model called Darknet-19 to be used as a backbone for YOLOv2.

Darknet-19 has 19 convolutional layers and 5 maxpooling layers.It achieves 91.2% top-5 accuracy on ImageNet which is better than VGG (90%) and YOLO network(88%).

Image for post
Image for post
Darknet 19

YOLOv2 output shape is 13x13x(kx(1+4+20)) where k is the number of anchor boxes , 20 is the number of classes . We have k=5. So the output shape will be 13x13x125.

The model was trained for classification first and then for detection.

Classification training:

  • First they trained Darknet-19 network in an ImageNet 1000 class dataset for 160 epochs using stochastic gradient descent and weight decay.
  • During training author used standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.
  • After our initial training on image of 224×224 they fine tuned darkenet network using larger size image 448x448 for 10 epochs

Detection training:

  • After training for classification the last layer of darknet-19 is removed and replaced with 3x3 convolution with 1024 filters followed by 1x1 convolution with number of output we needed for detection(13x13x125). Also a pass through layer was added so that our model can use fine grain features from previous layers.
  • Then they trained the network for 160 epochs on detection datasets (VOC and COCO datasets)

YOLOv2 is faster than other object detection algorithms.Also it can be run on images of different sizes to provide a smooth trade off between speed and accuracy.

Image for post
Image for post
Image for post
Image for post

3. YOLO9000 using word tree(Stronger)

  • Proposes a mechanism of joint training on classification and detection data. ie, during training they mix images from both detection and classification datasets. When the network sees an image for detection, full YOLO2 loss is back propagated and when network sees a classification image, only classification part of loss is back propagated.
  • We know normally detection datasets have classes like dog,cat,boat etc while in classification datasets we have classes like ‘Norfolk terrier’, ‘Yorkshire terrier’ etc. If we need to train both datasets we need to find a way.
  • Normally we directly uses softmax over all the classes. But here we can’t do that since softmax takes mutually exclusive events but here ‘dog’ and ‘Norfolk terrier’ are not mutually exclusive.So we cant use direct softmax.
  • Solution is to use hierarchical classification for labeling.
  • Here the Image net labels are pulled from WordNet. In WordNet, “Norfolk terrier” and “Yorkshire terrier” are both hyponyms of “terrier” which is a type of “hunting dog”, which is a type of “dog”, which is a “canine”, etc
  • They build a Wordtree using Imagenet labels and WordNet.
Image for post
Image for post

They combined the following datasets for training:

i)Microsoft COCO: Contains 100k images, 80 classes, detection labels, classes are more general like “dog” or “boat”.

ii)ImageNet: 13 million images, 22k classes, classification labels, classes are more specific like “Norfolk terrier”, “Yorkshire terrier”, or “Bedlington terrier”.

  • They created a combined data set using COCO data set and top 9000 classes from Image net. The corresponding word tree consists of 9418 classes. The extra classes are formed by mapping original classes with synsets in the tree.
  • During training, ground truth labels are propogated up .ie, if an image is labelled as ‘Norfolk terrier’ it is also labelled as ‘dog’. Using this data set they trained YOLO9000.Instead of 5 anchor boxes they used 3 anchor boxes.

To compute the conditional probabilities our model predicts a vector of 9418 values and we compute the softmax over all sysnsets that are hyponyms of the same concept. The output conditional probabilities will be as follows:

Image for post
Image for post

Suppose we need to find absolute probability we find it as follows:

Image for post
Image for post

For classification purposes we assume that the the image contains an object:Pr(physical object) = 1

  • When it sees a classification image we only backpropagate classification loss. To do this we simply find the bounding box that predicts the highest probability for that class and we compute the loss on just its predicted tree.
  • When it sees detection images entire yolo2 loss is back propagated. We also assume that the predicted box overlaps what would be the ground truth label by at least 0.3 IOU and we backpropagate objectness loss based on this assumption.
Image for post
Image for post
experiment prediction using 1369 classes(1000 + 369)
  • Thus using this joint training, YOLO9000 learns to find objects in images using the detection data in COCO and it learns to classify a wide variety of these objects using data from ImageNet.
Image for post
Image for post

Data Driven Investor

from confusion to clarity not insanity

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Arun Mohan

Written by

Machine Learning | AI

Data Driven Investor

from confusion to clarity not insanity

Arun Mohan

Written by

Machine Learning | AI

Data Driven Investor

from confusion to clarity not insanity

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store