YOLO — You Only Look Once

Ray Wang
5 min readDec 29, 2021

A state of the art real-time object detection algorithm

Photo by Ilya Pavlov on Unsplash

Introduction

Have you ever thought about how we perceive objects? Or how we seamlessly capture information through our eyes and collect meaningful data from it frame by frame? Well to us, we do it without even thinking about it. All the information we see in our eyes gets sent to the brain where it all magically gets decoded. With one look, we can get a great understanding of our surroundings and the objects we are looking at.

What if I told you we were able to give this ability to machines. Machines with eyes that can recognize objects to better perform the tasks it was meant to do. From self-driving cars to a friendly Zoom call trying to blur your background, or even just face recognition on your phone, Object Detection is everywhere and is used in a variety of ways to help improve the quality of our lives. But without a brain like us humans, how can machines be capable of such a daunting and magnificent task?

Challenges

Before going into how YOLO works, we will explore the challenges and hurdles needed to overcome in an object detection system.

1. Not a set number of objects

In object detection, the system gets an image as an input and tries to output what it thinks are objects in the image. What adds to the complexity of this is that the system does not know how much to output beforehand which will result in some needed post-processing to figure which are actually objects.

2. Aspect ratios and scaling

Objects in images can have different spatial scales and aspect ratios. Some can cover the whole image while others can be in a tiny corner that’s just a few pixels small. The variation in dimension and scaling in different pictures creates a lot of difficulties in detecting some objects.

3. Speed for real-time detection

Not only does object detection algorithms have to accurately detect the class and location of the object, but it also has to do it extremely quickly to keep up with videos and real-time video processing. Usually, videos are filmed at 24 fps, and making an algorithm to keep up with this frame rate is quite difficult.

Photo by James Harrison on Unsplash

The YOLO algorithm

There have been many object detection algorithms in the past that could well perform their tasks, but the main problem with all of them was that they failed to keep up with videos and did not have the speed for real-time detection.

YOLO, also known as You Only Look Once, has much better performance and is able to keep up with high fps for real-time usage. YOLO is based on regression, which predicts bounding boxes and classes for the whole image in one run of the algorithm which is a reason why it’s so fast.

To understand how YOLO works, we first have to understand how objects are being predicted and located. There are two main things YOLO predicts: the class of the object (what the object is), and the bounding box (where it is).

Each bounding box contains 4 main descriptors:

1. The center of the box (Bx, By)

2. The width of the box (Bw)

3. The height of the box (Bh)

4. The class of the object (c)

Along with these four descriptors is a real number value pc that represents the possibility that there is actually an object in that specific bounding box.

The first thing YOLO does when receiving an object is splitting it into grids/cells that are used to predict bounding boxes. They are usually split into a 19x19 cell. An object is considered inside a cell if the center coordinates of the box lie in that cell.

Afterward, YOLO calculates and gives a score that represents the possibility of the cell containing a certain class. The equation for calculating the score is:

The probability that there is an object of certain class ‘c’

The class with the highest score/probability is assigned to that particular cell. This will happen for every single cell in the image.

After completing this for every single cell, your image may look something like this:

After predicting classes for every single cell in the image, the image may have a lot of unnecessary boxes which comes from the challenge of not having a set number of objects.

The next step is to remove these unnecessary boxes by using Non-max suppression. Using non-max suppression, it performs IoU (Intersection over Union), which basically calculates the amount of overlap between two bounding boxes.

It will calculate the IoU of boxes with respect to the ones that have the highest class probabilities. If the IoU of the two boxes is greater than a set threshold, then the program will eliminate the box with lower class probability, as it means two boxes are covering the same area.

The Before and After result of non-max suppression

After this, the YOLO algorithm finally outputs the bounding boxes with their respective classes.

Applications of YOLO

The YOLO algorithm can be applied to these three main fields:

  • Self Driving Vehicles: Used to detect other cars, pedestrians, and driving signals to avoid collisions and drive around safely
  • Security: Used in facial identification in phones, computers, and security cameras.
  • Wildlife: To help rangers and scientists identify creatures in videos that people may have missed.

--

--