Object Detection Algorithm — YOLO v5 Architecture

Surya Gutta
Analytics Vidhya
Published in
3 min readAug 2, 2021


History and architecture of YOLO v5

Yolo V5 Architecture

CNN-based Object Detectors are primarily applicable for recommendation systems. YOLO (You Only Look Once) models are used for Object detection with high performance. YOLO divides an image into a grid system, and each grid detects objects within itself. They can be used for real-time object detection based on the data streams. They require very few computational resources.

History of YOLO

Note: There is no paper on YOLOv5 as of Aug 01, 2021, based on the comment here. Therefore, this post will elaborate on YOLOv4 so that it’s easy to understand YOLOv5.

To understand how Yolov5 improved the performance and its architecture, let us go through the following high-level Object detection architecture:

source: Yolov4 paper

General Object Detector will have a backbone for pre-training it and a head to predict classes and bounding boxes. The Backbones can be running on GPU or CPU platforms. The Head can be either one-stage (e.g., YOLO, SSD, RetinaNet) for Dense prediction or two-stage (e.g., Faster R-CNN ) for the Sparse prediction object detector. Recent Object detectors have some layers (Neck) to collect feature maps, and it is between the backbone and the Head.

In YOLOv4, CSPDarknet53 is used as a backbone and SPP block for increasing the receptive field, which separates the significant features, and there is no reduction of the network operation speed. PAN is used for parameter aggregation from different backbone levels. YOLOv3 (anchor-based) head is used for YOLOv4.

Note: Please go through the above links for more details on CSPDarknet53, SPP, PAN, and YOLOv3.

YOLOv4 introduced new methods of data augmentation Mosaic and Self-Adversarial Training (SAT). Mosaic mixes four training images. Self-Adversarial Training operates in two forward and backward stages. In the 1st stage, the network alters the only image instead of the weights. In the second stage, the network is trained to detect an object on the modified image.

source: Mosaic data augmentation
source: Various methods of data augmentation in Yolov4

Apart from the above-mentioned modules, some existing methods (Spatial Attention Module[SAM], PAN, CBN) have been modified to improve the performance.

Yolov5 almost resembles Yolov4 with some of the following differences:

  • Yolov4 is released in the Darknet framework, which is written in C. Yolov5 is based on the PyTorch framework.
  • Yolov4 uses .cfg for configuration whereas Yolov5 uses .yaml file for configuration.

YOLOv5s model displayed in Netron

Yolov5s model

YOLOv5s model displayed in TensorBoard

source: yolov5

Please go through the Yolov5 Github repo for additional information.

Thank you for reading! Please 👏and follow me if you liked this post, as it encourages me to write more!



Surya Gutta
Analytics Vidhya

Software Architect | Machine Learning | Statistics | AWS | GCP