Yolo V4 Object Detection

How Yolo V4 object detection delivers higher mAP and shorter inference time

Renu Khandelwal
May 16 · 6 min read

Enhanced Features of Yolo v4

  • Yolo v4 has a faster inference speed for an object detector in production systems.
  • Optimization for parallel computations
  • Yolo v4 is an efficient and powerful object detection model using a single GPU to deliver an accurate object detector quickly.

Object detector models are composed of

The backbone of the Object detector can be pre-trained neural network.


Two-stage detectors have a more complicated pipeline. .

Both one-stage and two-stage detectors can be made an anchor-free object detector.

Source:YOLOv4: Optimal Speed and Accuracy of Object Detection

YOLOv4 consists of:

  • Backbone: CSPDarknet53
  • Neck: SPP, PAN
  • Head: YOLOv3

It is a method that only changes the training strategy or only increases the training cost.

A few of these training strategies in Bag of Freebies are

  • It uses photometric distortions like brightness, contrast, hue, saturation, and noise of an image and geometric distortions like random scaling, cropping, flipping, and rotating. . This data augmentation helps the model to localize different types of images in different portions of the frame.
Mosaic Data Augmentation Source:YOLOv4: Optimal Speed and Accuracy of Object Detection
  • is a new data augmentation technique that operates in 2 forward-backward stages. . I
  • . The focal loss function is a dynamically scaled cross-entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Class imbalance causes two issues for one-stage object detector. The first issue is (1) training is inefficient as most locations are easy negatives that contribute no useful learning signal; (2) easy negatives can overwhelm training and lead to degenerate models.
  • to design the label refinement network. Knowledge distillation compresses a large pre-trained model(teacher) into a small(student) model. In this technique, the knowledge is transferred from the teacher model to the student model by minimizing a loss function aimed at matching softened teacher logits and ground-truth labels. The logits are softened by applying the scaling function in the softmax that effectively smoothes out the probability distribution and reveals inter-class relationships learned by the teacher.
  • Traditional object detectors use L¹ Norm loss for bounding box regression treating these coordinates of the bounding box as an independent variable and not considering the object's integrity.

Bag of Specials

Bag of Specials is a post-processing method that increases the inference cost by only a small amount but improves object detection accuracy by a significant amount.

  • using Spatial Pyramid Pooling(SPP) which integrates SPM(Spatial Pyramid Matching) into CNN and use max-pooling operation
  • used in object detection is channel-wise attention using Squeeze-and-Excitation(SE) and pointwise attention using Spatial Attention Module (SAM). SM improves channel interdependencies at almost no additional computational cost.
  • integrate low-level physical features to high-level semantic features.

Activation functions in YOLO v4

As activation functions play a crucial role in the performance and training dynamics in neural networks. Activation functions are non-linear point-wise functions responsible for introducing nonlinearity to the linear transformed input in a neural network layer.

ReLU6 and hard-Swish, are specially designed for quantization networks. Both Swish and Mish are continuously differentiable activation functions.

Mish Activation function

Mish tends to match or improve the performance of neural network architectures compared to that of Swish, ReLU, and Leaky ReLU across different Computer Vision tasks.

Mish eliminates the Dying ReLU phenomenon, which helps in better expressivity and information flow. Mish avoids saturation, which generally causes training to slow down due to near-zero gradients drastically.

Yolo V4 Architecture

An optimal Object Detection algorithm requires the following features

  • Larger input network size for detecting multiple small-sized objects
  • More layers for a higher receptive field will allow viewing the entire object, viewing the context around the object, and increases the number of connections between the image point and the final activation
  • More parameters — for greater capacity of a model to detect multiple objects of different sizes in a single image

Adding SPP block over the CSPDarknet53 significantly increases the receptive field to separate the most significant context features and causes almost no reduction of the network operation speed.

Yolo V2 uses DropBlock, a simple regularization technique similar to dropout. DropBlock drops contiguous regions from a feature map layer instead of dropping out independent random units in the dropout technique.

Additional Improvements in YoloV4

  • Yolov4 also uses a for selecting optimal hyperparameter during network training on the first 10% of time periods
  • collects statistics inside the entire batch instead of collecting statistics inside a single mini-batch, thus effectively aggregating statistics across multiple training iterations.

Performance of YoloV4

Source:YOLOv4: Optimal Speed and Accuracy of Object Detection

YOLOv4 runs twice faster than EfficientDet with comparable performance. Improves YOLOv3’s AP and FPS by 10% and 12%, respectively. YOLOv4 is superior to the fastest and most accurate detectors in terms of both speed and accuracy.

YOLOv4: Optimal Speed and Accuracy of Object Detection

Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression

Focal Loss for Dense Object Detection

Mish: A Self Regularized Non-Monotonic Activation Function

DropBlock: A regularization method for convolutional networks

Distilling the Knowledge in a Neural Network


Geek Culture

Proud to geek out. Follow to join our +500K monthly readers.

Renu Khandelwal

Written by

Loves learning, sharing, and discovering myself. Passionate about Machine Learning and Deep Learning

Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Renu Khandelwal

Written by

Loves learning, sharing, and discovering myself. Passionate about Machine Learning and Deep Learning

Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store