I wish I knew this about YOLOv5

Zheng Jie
6 min readMar 30, 2023

--

Using YOLOv5 seems very easy. The GitHub page provides a comprehensive explanation and most are able to get started immediately using a few simple lines of code to make predictions or train the model. Being a slow but thorough learner myself, I wondered what YOLOv5 was, what type of neural network it was, and wanted to learn more about the nuances of how to use it.

There are definitely much better websites to learn from, but combining an old adage with one from Sir Francis Bacon:

“Knowledge is power, and sharing is caring”

Thus, we arrive here, where I present some things I wish I had known about YOLOv5, its architecture, usage, and structure.

Overview

Logo from https://wandb.ai/fully-connected/projects/yolov5

YOLOv5 (or You Only Look Once version 5) was a descendant from a long line of object detectors, succeeding YOLO (2016) and YOLOv3 (2018) and preceeding YOLOv8 (2023, ironically a week after I started learning about YOLOv5). Developed mainly by Glenn Jocher from Ultralytics and maintained by Ayush Chaurasia and Sergiu Waxmann, it boasts fast inference and detection times while taking up minimal space. Easy-to-train and easy-to-use, this makes YOLOv5 a valuable tool in the average Deep Learning practitioner’s arsenal.

Architecture

This to me was the hardest aspect to understand. Having been an MOOC noob, I had to understand the general gist of object detectors and computer vision before taking on the monstrosity that was the architecture.

YOLOv5 is essentially a Single-Shot Detector (SSD). Where conventional Neural Networks used sliding windows to “take in” and convolute features like in Convolutional Neural Networks (CNN, explained well by this article by Sumit Saha), SSDs “absorb” the image data at once (thus making it single-shot)

Architecture of SSDs from https://developers.arcgis.com/python/guide/how-ssd-works/

SSDs usually consist of a “backbone model” (white blocks above), usually VGG16 (Visual Geometry Group 16) or ResNet (Residual Network), to extract features from the image data, and a “SSD head” (blue blocks) which follows the backbone model to produce detections, classifications and confidences. YOLOv5 uses the “YOLOv5 backbone” (duh) and is unique in the sense that it consists of a modified CSPDarknet (Cross Stage Partial Darknet) backbone and a PANet (Path Aggregation Network) “neck” model preceeding the YOLOv5 SSD head (basically an additional model in between).

Grid Cells

SSDs use “grid cells” responsible for detection and classification in their respective areas of the image. From: https://developers.arcgis.com/python/guide/how-ssd-works/

The shortcomings of the sliding-window convolution model was that it was first and foremost computationally expensive, and that it risked the exclusion of objects depending on the size and pace of the sliding window per pixel. A new system, the “grid cell” system as seen above, enabled YOLOv5 to extract features on one pass of the image through splitting the workload over many grid cells.

Anchor Boxes

Each grid cell uses the concept of anchor boxes to locate and classify objects within their domain. Anchor boxes are essentially “bounding boxes” of certain labelled objects.

Anchor boxes demonstrated on a car and person. Source: https://github.com/sarangzambare/object-detection

They are generally represented in two forms: (x, y, w, h) (x and y coordinates of the center of the box, width and height) or (xmin, ymin, xmax, ymax).

The grid cells compare each feature detected with many of these anchor boxes and computes the Intersection-Over-Union (IOU) of these detections with the anchor boxes to determine if:

  1. an object exists, and
  2. if so, what object is there.

Receptive Fields

A feature that takes up a 3x3 grid area is convoluted to a 1x1 cell in the third layer. Source: https://www.researchgate.net/figure/The-receptive-field-of-each-convolution-layer-with-a-3-3-kernel-The-green-area-marks_fig4_316950618

Receptive Fields are another key component of not just SSDs, but Deep Learning Models in general. A “receptive field” in layman’s terms refers to which features the machine is “looking” at.

To elaborate, SSDs use partially-connected layers (where not all areas of the image are processed and convoluted at once), which means that they “see” a part of the image at one given time (like how we process large images). Treating Receptive Fields like the “range of vision” of the model, we can thus reason that we need different sizes of receptive fields to see objects with varying sizes.

Deployment

YOLOv5 can be deployed multiple ways:

  1. Cloning the GitHub repository and running it locally
  2. Using torch.hub.load() from PyTorch
  3. pip install yolov5 ( :O )

If a pretrained YOLOv5 is required to be deployed fast, option 2 may be the best, but if the architecture is need for training and fine-tuning, I would recommend option 1.

Spicy features

The spicy hasta la vista stuff can be found in the repository (option 1 lmao), and it allows for flexibility in modifying and defining custom architecture and parameters for specialised testing.

YOLOv5 offers a variety of pretrained weights on the COCO dataset. These generally increase in accuracy at the cost of size, and vary from YOLOv5n (nano) to YOLOv5x6 (XTRA LARGE EL PRIMOOO). More details on the mAP and accuracy-to-GPU speed graph can be found on the repository page (link in option 1).

Another spicy feature is how YOLOv5 processes its outputs. A standard YOLOv5 carries out the prediction using a DetectMultiBackend model class and processes the outputs using AutoShape. The DetectMultiBackend spits out the raw unprocessed data of detections, confidences and bounding boxes, which AutoShape converts into a presentable Detection object which can be displayed and viewed.

Disregarding local training of the model, the DetectMultiBackend model enables the customisation of custom, pretrained weights and other parameters as opposed to the limited flexibility that torch.hub.load has, so usage can occur without training the model yourself.

# load in example visdrone weights not found in torch.hub 
# ensure same file as the yolov5 folder

from models.common import DetectMultiBackend, AutoShape
model = DetectMultiBackend("yolov5s-visdrone.pt", device=device, \
dnn=False, data = 'data/coco.yaml', \
fp16=False)

# AutoShape can be applied on top of DetectMulti Backend
neat_model = AutoShape(model)

As we can see, we have more control over the model (and parameters like dnn and fp16) as compared to loading via torch.hub.load. Of course, using the command line would be much better, but using the DetectMultiBackend as such isn’t a slouchy method either.

Spiciest code

Understanding the Detections class was also a huge mind-blow for me. It is apparently possible to extract the bounding boxes for any predictions, the overall objectness score and the classification scores per grid cell from the output of DetectMultiBackend (which is not a Detection object)

pred = model(image)

# Now let's extract stuff from DetectMultiBackend
obj = pred[0][:, :, 4] # objectness score
cls_m = pred[0][0, :, m] # cls score for class m

The output from the DetectMultiBackend is quite curious. It consists of 2 items:

  1. Mystery meat (I still have no idea what this is)
  2. The detections data of shape [N x Q x 85], where N is the batch size of the images input and Q is the number of grid cells (25500 in my case)

The detections data can be interpreted as such:

  1. The first 5 rows (index 0 to 4) indicate, in order, the probability of bounding box coordinates (xmin, ymin, xmax, ymax) existing at each grid cell as well as the objectness per grid cell for all grid cells.
  2. The next 80 rows (5 to 85) are the probability of class i existing at all grid cells

Thus, our confidence score (if we assume conf = obj * max(cls)) for class i would be:

pred = model(image)
cls_i = torch.max(pred[0][0, :, i]) #get the max cls amongst all grid cells
obj = pred[0][:, :, 4] # [1 * N]
conf = obj * cls_i

And that is all I have to offer. There is definitely more to explore with not just YOLOv5, but all the countless other SSDs and Deep Learning Models. With research into new architecture being aided by stronger computational power and greater understanding in neural networks, I have no doubt that a new YOLOv5 (or v10 :O) will emerge faster and better in the near future.

Just like how Moore’s Law states:

“The number of transistors in an Integrated Circuit doubles every two years.”

I would like to propose a new law along the same line:

“The performance of neural networks doubles every five years.”

Thank you.

--

--