Review On YOLOv1

Arun Mohan
Jun 1 · 7 min read
Image for post
Image for post

YOLO stands for You Only Look Once. As the name says, network only looks the image once to detect multiple objects. The main improvement on this paper is the detection speed(45 fps using YOLO and 155 fps using Fast YOLO). This is another state-of-the-art deep learning object detection approach which has been published in 2016 CVPR by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi.

We divide our article to sub categories:

  1. Unified Detection
  2. Network Design
  3. Training, Loss Function And Inference.
  4. Limitations Of YOLO
  5. Results

1. Unified Detection

Traditional approach in all object detection algorithms is to use sliding window approach.But it is computationally expensive. YOLO overcomes this by using the concepts of grid cells.

Image for post
Image for post
  • YOLO uses a SxS grid cells(here S=7). If the center of the object falls on a grid cell that grid cell is responsible for detecting the object.
  • Each grid cell produces B (B = 2)bounding boxes and confidence score/objectness score of these boxes.Confidence score means how confident that the box contains the object.

confidence score = Pr(obj)∗IoU predtruth

IOU predtruth means IOU of the predicted box with ground truth box

Image for post
Image for post
  • Each bounding box prediction consists of x,y,w,h coordinates and confidence score. The (x,y) coordinates indicates the center of the box relative to the bounds of grid cell.The (w,h) coordinates indicates the width and height with respect to the image.
Image for post
Image for post
  • Each grid cell also produces conditional probability of each class Pr(Class_i|Object). Here total number of classes is 20
output tensor(src:https://medium.com/@amrokamal_47691/yolo-yolov2-and-yolov3-all-you-want-to-know-7e3e92dc4899)
output tensor(src:https://medium.com/@amrokamal_47691/yolo-yolov2-and-yolov3-all-you-want-to-know-7e3e92dc4899)
output tensor

Thus in total we have 7×7×(2×5+20)=1470 predictions.

  • At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth
  • During testing class specific confidence of each box is obtained by multiplying conditional class probability with confidence score.

Pr(Class_i​∣Object)×Pr(Object)×IoU predtruth ​= Pr(Class_i​)

× IoU predtruth

2. Network Design

Image for post
Image for post

The YOLO network architecture is inspired from GoogLeNet architecture. The network has 24 convolutional layers followed by 2 fully connected layers.Instead of the inception modules used by GoogLeNet, we simply use1×1reduction layers followed by 3×3 convolutional layers.

Fast YOLO uses few convolution layers(9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

3. Training, Loss Function And Inference

3.1 Training

Following this for detection training, they removed 1x1000 fully connected layer and added four convolutional layers and two fully connected layers with randomly initialized weights.Also they changed the input resolution from 224x224 to 448x448 since this helps in detecting smaller objects.

Final layer predicts both class probabilities and bounding boxes.The (x,y) coordinates of bounding box are parameterized so their values become between 0 and 1 with respect to grid cell.The height and width coordinates are normalized with respect to height and width of image.

They have used linear activation in final layer and leaky relu in other layers.

Image for post
Image for post
leaky relu

3.2 Loss function

Image for post
Image for post
loss function

Note: 1obj_ij is 1 when there is object 0 when there is no object.similarly 1noob_ij is 1 when there is no object 0 when there is object.

λcoord= 5 and λnoobj=0.5.

There are 5 terms in loss function:

  • 1st term : The x,y coordinates are parameterized by offsets of a grid cell and their values are between 0 and 1.So sum of squared error is used.
  • 2nd term: The width and height is relative to the whole image.So we cannot directly use squared error since a small difference can create huge impact.So this is partially solved by taking square root and then finding squared error.
  • 3rd term and 4th term: This is the IOU between ground truth and predicted bounding box.In many grid cells, there will be no object.It pulls their confidence score to zero. This can overpower gradients during back propagation and can cause model unstable.This is solved partially by λnoobj =0.5.
  • 5th term: Class probabilities when there is object.

3.3 Inference

Image for post
Image for post
detection system

How Non-Max Suppresion works?

  1. Sort the proposals on the basis of confidence score in decreasing order.
  2. Select the proposal with highest confidence score, remove it from A and add it to the final proposal list B. (Initially B is empty).
  3. Now compare this proposal with all the proposals by calculating the IOU (Intersection over Union) of this proposal with every other proposal. If the IOU is greater than the threshold N, remove that proposal from A.
  4. Now gain sort the remaining proposals in A based on confidence score in descending order. Now take the proposal with the highest confidence from the remaining proposals in A and remove it from A and add it to B. Now as in step3 compare the proposals and do as needed.
  • This process(step 1,2 and 3) is repeated until there are no more proposals left in A.
Image for post
Image for post

Then the network is trained for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 They also include the VOC 2007 test data for training

4. Limitations of YOLO

  • Model struggles to detect nearby objects(if they overlap also).
  • Small error in small bounding box can affect iou more.

5. Results

5.1 VOC 2007

Image for post
Image for post
  • YOLO performs very well on real time detections. It outperforms Fast RCNN when we consider trade of between speed and mAp.
  • Fast YOLO also perfoms good with 52.7% mAp and at 155 fps
  • YOLO VGG16 performs better than YOLO in terms of mAp but is slower due to absence of 1x1 convolution to reduce model size.

5.2 Comparing Fast RCNN and YOLO

Image for post
Image for post

YOLO makes fewer background mistakes than Fasr RCNN. But YOLO struggles to localize objects compared to Fast RCNN.

Image for post
Image for post
error rates

5.3 VOC 2012

Image for post
Image for post
YOLO + Fast RCNN

On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art. But Fast R-CNN + YOLO has 70.7% mAP which is one of the highest performing detection methods.

5.4 Model Precision vs Recall

Image for post
Image for post
Picasso Dataset precision-recall curves

We can see YOLO out performs other object detections when Precision recall curve is plotted.This is based on detections from Picasso Dataset.

5.5 Some Visualizations

Image for post
Image for post

Data Driven Investor

empowering you with data, knowledge, and expertise

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Arun Mohan

Written by

Machine Learning | AI

Data Driven Investor

empowering you with data, knowledge, and expertise

Arun Mohan

Written by

Machine Learning | AI

Data Driven Investor

empowering you with data, knowledge, and expertise

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store