YOLO stands for You Only Look Once. As the name says, network only looks the image once to detect multiple objects. The main improvement on this paper is the detection speed(45 fps using YOLO and 155 fps using Fast YOLO). This is another state-of-the-art deep learning object detection approach which has been published in 2016 CVPR by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi.
We divide our article to sub categories:
- Unified Detection
- Network Design
- Training, Loss Function And Inference.
- Limitations Of YOLO
1. Unified Detection
In Prior state of art RCNN, first we generate region proposals and then detect objects using this region proposals.It is a two step process and hence slower. YOLO solves this by introducing a unified detection network. In YOLO, network uses the features created from entire image to predict bounding boxes. Also all the classes and bounding box are predicted simultaneously.
Fixing Photography | Data Driven Investor
Tom ZImberoff studied music at the University of Southern California before pivoting to photography. As a…
Traditional approach in all object detection algorithms is to use sliding window approach.But it is computationally expensive. YOLO overcomes this by using the concepts of grid cells.
- YOLO uses a SxS grid cells(here S=7). If the center of the object falls on a grid cell that grid cell is responsible for detecting the object.
- Each grid cell produces B (B = 2)bounding boxes and confidence score/objectness score of these boxes.Confidence score means how confident that the box contains the object.
confidence score = Pr(obj)∗IoU predtruth
IOU predtruth means IOU of the predicted box with ground truth box
- Each bounding box prediction consists of x,y,w,h coordinates and confidence score. The (x,y) coordinates indicates the center of the box relative to the bounds of grid cell.The (w,h) coordinates indicates the width and height with respect to the image.
- Each grid cell also produces conditional probability of each class Pr(Class_i|Object). Here total number of classes is 20
Thus in total we have 7×7×(2×5+20)=1470 predictions.
- At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth
- During testing class specific confidence of each box is obtained by multiplying conditional class probability with confidence score.
Pr(Class_i∣Object)×Pr(Object)×IoU predtruth = Pr(Class_i)
× IoU predtruth
2. Network Design
The YOLO network architecture is inspired from GoogLeNet architecture. The network has 24 convolutional layers followed by 2 fully connected layers.Instead of the inception modules used by GoogLeNet, we simply use1×1reduction layers followed by 3×3 convolutional layers.
Fast YOLO uses few convolution layers(9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
3. Training, Loss Function And Inference
The convolution layers are pretrained on ImageNet 1000-class dataset.They used first 20 convolution layers of network described earlier followed by a average pooling layer and a fully connected layer(1x1000).
Following this for detection training, they removed 1x1000 fully connected layer and added four convolutional layers and two fully connected layers with randomly initialized weights.Also they changed the input resolution from 224x224 to 448x448 since this helps in detecting smaller objects.
Final layer predicts both class probabilities and bounding boxes.The (x,y) coordinates of bounding box are parameterized so their values become between 0 and 1 with respect to grid cell.The height and width coordinates are normalized with respect to height and width of image.
They have used linear activation in final layer and leaky relu in other layers.
3.2 Loss function
Note: 1obj_ij is 1 when there is object 0 when there is no object.similarly 1noob_ij is 1 when there is no object 0 when there is object.
λcoord= 5 and λnoobj=0.5.
There are 5 terms in loss function:
- 1st term : The x,y coordinates are parameterized by offsets of a grid cell and their values are between 0 and 1.So sum of squared error is used.
- 2nd term: The width and height is relative to the whole image.So we cannot directly use squared error since a small difference can create huge impact.So this is partially solved by taking square root and then finding squared error.
- 3rd term and 4th term: This is the IOU between ground truth and predicted bounding box.In many grid cells, there will be no object.It pulls their confidence score to zero. This can overpower gradients during back propagation and can cause model unstable.This is solved partially by λnoobj =0.5.
- 5th term: Class probabilities when there is object.
YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.During detection normally there can be multiple detection for same object. They overcome this problem by Non-Max Suppression.
How Non-Max Suppresion works?
Suppose N is the threshold and we have two lists A and B where A is the input proposals and B is the output .Then for each class detected in image,
- Sort the proposals on the basis of confidence score in decreasing order.
- Select the proposal with highest confidence score, remove it from A and add it to the final proposal list B. (Initially B is empty).
- Now compare this proposal with all the proposals by calculating the IOU (Intersection over Union) of this proposal with every other proposal. If the IOU is greater than the threshold N, remove that proposal from A.
- Now gain sort the remaining proposals in A based on confidence score in descending order. Now take the proposal with the highest confidence from the remaining proposals in A and remove it from A and add it to B. Now as in step3 compare the proposals and do as needed.
- This process(step 1,2 and 3) is repeated until there are no more proposals left in A.
Then the network is trained for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 They also include the VOC 2007 test data for training
4. Limitations of YOLO
- Model struggles with detecting smaller objects
- Model struggles to detect nearby objects(if they overlap also).
- Small error in small bounding box can affect iou more.
5.1 VOC 2007
- YOLO performs very well on real time detections. It outperforms Fast RCNN when we consider trade of between speed and mAp.
- Fast YOLO also perfoms good with 52.7% mAp and at 155 fps
- YOLO VGG16 performs better than YOLO in terms of mAp but is slower due to absence of 1x1 convolution to reduce model size.
5.2 Comparing Fast RCNN and YOLO
YOLO makes fewer background mistakes than Fasr RCNN. But YOLO struggles to localize objects compared to Fast RCNN.
5.3 VOC 2012
On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art. But Fast R-CNN + YOLO has 70.7% mAP which is one of the highest performing detection methods.
5.4 Model Precision vs Recall
We can see YOLO out performs other object detections when Precision recall curve is plotted.This is based on detections from Picasso Dataset.