YOLO Object Detection

Published in

Analytics Vidhya

5 min readSep 19, 2021

Introduction

YOLO (You Only Look Once) is a widely used object detection system that is best used for real-time object detection because of its speed advantages. It is similar to the Single Shot MultiBox Detector (SSD) in that it predicts bounding boxes and class probabilities all in a single pass of the convolutional network. This is unlike other state-of-the-art object detection systems like Faster R-CNN in that the latter uses a region proposal network.

The main advantage of the YOLO series is that these systems run much faster, but that sometimes comes at the expense of a lower mean average precision when compared to the R-CNN series.

This article summarizes each version of the YOLO series and shows how it has evolved over the years to remain as one of the state-of-the-art object detection systems as of September 2021.

YOLOv1

The object detection problem is framed as a single regression problem, and a single convolutional network is used to simultaneously predict multiple bounding boxes and class probabilities. YOLO sees the entire image during training and test time, so it uses features extracted from the objects in the bounding boxes as well as the background.

The image is divided into an S * S grid, and each cell in the grid is responsible for detecting objects that are centered within that cell. Each cell predicts B bounding boxes and confidence scores for how likely an object exists in the box. For each bounding box, 5 outputs are produced: the (x, y) coordinates of the center of the box, width and height of the box, and the confidence score. Each grid cell also predicts C conditional class probabilities, which means if an object is indeed inside the box, what are the probabilities of that object being in each class. Overall, the YOLO model outputs an S * S * (5B + C) tensor.

The network architecture consists of 24 convolutional layers followed by 2 fully connected layers, and the convolutional layers alternate between 1 x 1 and 3 x 3 filters. A smaller network called Fast YOLO uses 9 convolutional layers instead of 24, and everything else is the same as YOLO.

YOLOv2 (YOLO9000)

YOLOv2 is named YOLO9000 because it’s able to detect over 9000 object classes. In order to achieve this, the biggest hurdle is in obtaining a large enough labeled dataset, which would be very expensive for 9000 object classes. Therefore, a hierarchical object classification structure is used, which enables training using multiple distinct datasets (so that each dataset doesn’t need to contain all 9000 classes), and even object classification datasets are used, which are much cheaper to obtain than object detection datasets.

The focus of YOLOv2 is to improve the recall and localization of YOLO while maintaining classification accuracy. The following approaches are introduced in YOLOv2:

Batch Normalization. By adding batch normalization to all the convolutional layers, we can introduce regularization to the model, and it enables removing the dropout step without overfitting.

High Resolution Classifier. The original YOLO resizes all input images to 224 x 224 when training and increases that resolution to 448 x 448 for detection. YOLOv2 uses the 448 x 448 resolution for both training and detection, which makes learning easier.

Anchor Boxes. Instead of using fully connected layers to predict bounding box coordinates, YOLOv2 removes the fully connected layers and uses anchor boxes to predict offsets instead of coordinates, which simplifies the problem and makes it easier for the network to learn. Using anchor boxes also allows the network to predict much more boxes, which greatly increases recall with only a small decrease in precision.

Dimension Clusters. Instead of picking anchor boxes manually, k-means clustering is run on the training set bounding boxes to automatically find good priors.

Direct Location Prediction. Using anchor boxes introduces a stability problem: because the predicted boxes are unconstrained, the predicted offset values would fluctuate between iterations and take a long time to stabilize. Therefore, constraints are added to the location predictions to make the network more stable and learn faster.

Fine-Grained Features. A passthrough layer is added to concatenate higher resolution features with lower resolution features by stacking adjacent features into different channels instead of spatial locations. This gives the network access to finer-grained features to capture smaller details.

Multi-Scale Training. Instead of fixing the input image size, the network randomly resizes the input image dimensions every few batches. This technique allows the network to become more robust in predicting across different input image dimensions.

Darknet-19. This new network architecture consisting 19 convolutional layers and 5 max pooling layers is introduced to improve the prediction speed.

Hierarchical Classification. Some datasets have broad classes like “dog” and “human”, while others have much more specific classes like different breeds of dogs. A hierarchical tree of classes is used to describe all of the classes in the available labeled datasets so that multiple datasets can be used to train the same classification model. This is why YOLOv2 is able to detect over 9000 classes.

YOLOv3

YOLOv3 offers an incremental improvement over YOLOv2. The improvements include:

Multi-class Prediction. Each labeled box may contain multiple labels, such as “dog” and “Corgi”. Instead of using softmax, multiple independent logistic classifiers are used.

Predictions Across Scales. YOLOv3 predicts boxes at 3 different scales, so it is better at finding small objects, which had been a major drawback of YOLOv2.

Darknet-53. This new backbone has 53 convolutional layers instead of the 19 layers in Darknet-19, used by YOLOv2. This change increased prediction accuracy at the expense of speed.

YOLOv4

YOLOv4 is not developed by the original developer of the first three YOLO versions, Joseph Redmon, who ceased his research due to ethical concerns in the application of computer vision in military applications and data protection issues.

YOLOv4, developed by Bochkovskiy, Wang, and Liao, offers some additional improvements over YOLOv3, and these improvements include:

CSPDarknet53. This new backbone is based on Darknet-53, and it uses the CSPNet (Cross Stage Partial Network) strategy to partition the feature map of the base layer into two parts. One part goes through a dense block and a transition layer, and then it gets merged with the other part to form the next layer.

Spatial Pyramid Pooling. By applying a pyramid of large and small filter sizes, this new layer helps to increase the receptive field of the features.

PANet. YOLOv4 uses the parameter aggregation technique from PANet (Path Aggregation Network) to enhance the feature hierarchy with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between layers.