All about YOLOs — Part2— The First YOLO

Published in

Analytics Vidhya

5 min readJan 20, 2020

Before YOLO there were two major object detection frameworks, DPM(Deformable parts model) and R-CNN both region-based classifiers where, as a first step they would find regions and for the second step, pass those regions to a more powerful classifier to get them classified. This approach involved looking at images thousands of times to perform detection. YOLO started as a project to optimize this approach by building a single neural network that takes a single image and gives back the detections and class in a single pass. That’s why the pun “You Only Look Once.”

This 5-part series aims to explain everything that is there about YOLO, it’s history, how it’s versioned, it’s architecture, it’s benchmarking, it’s code and how to make it work for custom objects.

Here are the links for the series.

All about YOLOs — Part1 — a little bit of History

All about YOLOs — Part2 — The First YOLO

All about YOLOs — Part3 — The Better, Faster and Stronger YOLOv2

All about YOLOs — Part4 — YOLOv3, an Incremental Improvement

All about YOLOs — Part5 — Up and Running

Approach

Take an image and imagine an overlaying grid on top of that image. Each cell in the grid is responsible for predicting a few different things.

The first thing is that it’s going to predict some number of bounding boxes and also confidence values of each bounding box (probability of the box contains an object).

Note: There may be some grid cells that don’t have any objects nearby but still going to predict some bounding boxes but the confidence for those will be very low.

Note: The thickness of the line indicates the confidence value

When every cell in the grid tries to predict some bounding boxes, we could see a map of all the objects in the image with boxes ranked by their confidence values. This map basically shows where the objects are in the image but don’t necessarily know what the objects are.

The next step is for each cell is to predict class probabilities. A thing to note is that this probably doesn’t say that this grid cell contains that object. It’s a conditional probability that says if there is an object in the cell, then that object is that class.

For the next step, we take these conditional probabilities and multiply them with confidence for the bounding boxes to get all the bounding boxes weighted by their actual probabilities of containing that object. This map shows a bunch of detections for the classified objects and a lot of them have pretty low confidence values.

To get a single best detection for an object, we perform a Non-Max Suppression which is basically suppressing the non-maximum values. i.e. all the low confidence values leaving the best one as is.

This parameterization fixes the output size for each cell predictions. For each bounding box, it predicts 4 coordinates and 1 confidence values and some number of class probabilities. This leaves it with manageable parameters to predict and can be trained with one neural network to be a whole detection pipeline.

This kind of a seamless single network takes as much time as a typical classification network making the YOLO really fast and also achieve the “You only look once” part of the goal.

Training process

Let’s talk a bit about the training. We get an image with ground truth labels. The first job is to match each ground truth label with the appropriate grid cell that we want to predict at the test time of that detection.

For that, we take the center of the bounding box and whatever gird cell that center falls into will be responsible for that detection.

So, we adjust that cell’s class prediction to match the ground truth. We also have to adjust that cell’s bounding box proposals.

So, we look at the cell’s predicted boxes and figure out which one overlaps most with our ground truth label and increase the confidence and adjust the coordinates.

We also look at other bounding boxes and decrease confidence. For all the cells without ground truth detections overlapping with them, we decrease the confidence values, since they don’t have any objects.

The thing to note is that we don’t adjust the class probabilities or coordinates for these cells as they don’t contain objects.

Training is done with some standard tricks in computer vision mainly, Pretraining on ImageNet, SGD with decreasing learning rate and Extensive data augmentation, etc.

This concludes the explanation of how YOLO works. I hope this gave a holistic picture of how the algorithm works and how the training is done.

In the next blog of the series, let’s look at YOLOv2, a better, faster and stronger version.

Resources:

YOLO: https://arxiv.org/pdf/1506.02640.pdf

YOLOv2 and YOLO9000: https://arxiv.org/pdf/1612.08242.pdf

YOLOv3:https://arxiv.org/pdf/1804.02767.pdf

About me

I am a Senior AI Expert at Wavelabs.ai. We at Wavelabs help you leverage Artificial Intelligence (AI) to revolutionize user experiences and reduce costs. We uniquely enhance your products using AI to reach your full market potential. We try to bring cutting edge research into your applications.

Feel free to explore more at Wavelabs.ai.

Well, that’s all in this post. Thanks for reading :)

Stay Curious!

You can reach out to me on LinkedIn.

All about YOLOs — Part2— The First YOLO

Approach

Training process

Written by Rehan Ahmad