Yolo V4 Object Detection
How Yolo V4 object detection delivers higher mAP and shorter inference time
Enhanced Features of Yolo v4
- Yolo v4 has a faster inference speed for an object detector in production systems.
- Optimization for parallel computations
- Yolo v4 is an efficient and powerful object detection model using a single GPU to deliver an accurate object detector quickly.
Object detector models are composed of
- A pre-trained Backbone
- Head that is used to predict classes and bounding boxes of objects.
The backbone of the Object detector can be pre-trained neural network.
Example: ImageNet, VGG16 , ResNet-50 , SpineNet , EfficientNet-B0/B7, CSPResNeXt50 or, CSPDarknet53 or ShuffleNet running on CPU.
Object detector models insert additional layers between the backbone and head which are referred to as the Neck of the object detector. Neck layers collect feature maps from different stages and are composed of several bottom-up paths and several topdown paths.
Examples: FPN, Path Aggregation Network (PAN), BiFPN, and NAS-FPN
Object detector head predicts classes and bounding boxes of objects and can be a One-stage detector or a Two-stage detector
One-stage detectors have a straightforward, efficient, and elegant architecture and the outputs of the network are classification probabilities and box offsets at each spatial position.
Example: YOLO, SSD, RetinaNet, CenterNet, CenterNet, etc.
Two-stage detectors have a more complicated pipeline. The first stage filters out the regions that have a high probability of containing an object from the entire image with the region proposal network. These RPN’s are then fed into the second stage, where the region convolutional network (R-CNN) gets the classification score and the spatial offsets.
Examples: R-CNN, Fast R-CNN, Faster R-CNN, R-FCN, and Libra R-CNN
Both one-stage and two-stage detectors can be made an anchor-free object detector.
YOLOv4 consists of:
- Backbone: CSPDarknet53
- Neck: SPP, PAN
- Head: YOLOv3
Bag of Freebies
It is a method that only changes the training strategy or only increases the training cost.
A few of these training strategies in Bag of Freebies are
- Data augmentation increases the variability of the input images. It uses photometric distortions like brightness, contrast, hue, saturation, and noise of an image and geometric distortions like random scaling, cropping, flipping, and rotating. Yolo v4 uses the Mosaic data augmentation method that mixes 4 training images. This data augmentation helps the model to localize different types of images in different portions of the frame.
- Self-Adversarial Training (SAT) is a new data augmentation technique that operates in 2 forward-backward stages. Stage 1 executes an adversarial attack on itself by altering the original images to create a deception that there is no desired object in the image. In the 2nd stage, the neural network is trained to detect an object on this modified image in the normal way.
- Solving semantic distribution problem with dataset imbalance using focal loss. The focal loss function is a dynamically scaled cross-entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Class imbalance causes two issues for one-stage object detector. The first issue is (1) training is inefficient as most locations are easy negatives that contribute no useful learning signal; (2) easy negatives can overwhelm training and lead to degenerate models. Focal loss applies a modulating term to the cross-entropy loss to focus learning on hard negative examples.
- Knowledge distillation to design the label refinement network. Knowledge distillation compresses a large pre-trained model(teacher) into a small(student) model. In this technique, the knowledge is transferred from the teacher model to the student model by minimizing a loss function aimed at matching softened teacher logits and ground-truth labels. The logits are softened by applying the scaling function in the softmax that effectively smoothes out the probability distribution and reveals inter-class relationships learned by the teacher.
- Bounding Box (BBox) regression is a crucial step in object detection. Traditional object detectors use L¹ Norm loss for bounding box regression treating these coordinates of the bounding box as an independent variable and not considering the object's integrity. Yolo v4 recommends using IOU loss for bounding box regression like Distance IoU or Complete IoU loss function, leading to faster convergence and better performance.
Bag of Specials
Bag of Specials is a post-processing method that increases the inference cost by only a small amount but improves object detection accuracy by a significant amount.
- Enlarging the receptive field using Spatial Pyramid Pooling(SPP) which integrates SPM(Spatial Pyramid Matching) into CNN and use max-pooling operation
- The Attention mechanism used in object detection is channel-wise attention using Squeeze-and-Excitation(SE) and pointwise attention using Spatial Attention Module (SAM). SM improves channel interdependencies at almost no additional computational cost.
- Strengthening feature integration capability using skip connections and hypercolumns integrate low-level physical features to high-level semantic features.
Activation functions in YOLO v4
As activation functions play a crucial role in the performance and training dynamics in neural networks. Activation functions are non-linear point-wise functions responsible for introducing nonlinearity to the linear transformed input in a neural network layer.
ReLU6 and hard-Swish, are specially designed for quantization networks. Both Swish and Mish are continuously differentiable activation functions.
Yolo V4 uses Mish, a novel self-regularized non-monotonic activation function inspired by the self-gating property of Swish.
Mish tends to match or improve the performance of neural network architectures compared to that of Swish, ReLU, and Leaky ReLU across different Computer Vision tasks.
Mish eliminates the Dying ReLU phenomenon, which helps in better expressivity and information flow. Mish avoids saturation, which generally causes training to slow down due to near-zero gradients drastically.
Yolo V4 Architecture
An optimal Object Detection algorithm requires the following features
- Larger input network size for detecting multiple small-sized objects
- More layers for a higher receptive field will allow viewing the entire object, viewing the context around the object, and increases the number of connections between the image point and the final activation
- More parameters — for greater capacity of a model to detect multiple objects of different sizes in a single image
CSPDarknet53 neural network is the optimal backbone model o for a detector with 29 convolutional layers 3 × 3, a 725 × 725 receptive field and 27.6 M parameters.
Adding SPP block over the CSPDarknet53 significantly increases the receptive field to separate the most significant context features and causes almost no reduction of the network operation speed.
Yolo V2 uses DropBlock, a simple regularization technique similar to dropout. DropBlock drops contiguous regions from a feature map layer instead of dropping out independent random units in the dropout technique.
Adding PANet as the Neck of the Yolo v4 object detection model for parameter aggregation from different backbone levels for different detector levels
Additional Improvements in YoloV4
- Yolov4 also uses a Genetic algorithm for selecting optimal hyperparameter during network training on the first 10% of time periods
- Cross mini-Batch Normalization collects statistics inside the entire batch instead of collecting statistics inside a single mini-batch, thus effectively aggregating statistics across multiple training iterations.
Performance of YoloV4
YOLOv4 runs twice faster than EfficientDet with comparable performance. Improves YOLOv3’s AP and FPS by 10% and 12%, respectively. YOLOv4 is superior to the fastest and most accurate detectors in terms of both speed and accuracy.