Yolo Algorithm (The Layman’s Approach)

5 min readMay 23, 2021

What is Yolo?

Yolo stands for “You Only Look Once”. It’s a state of the art Deep Learning Algorithm that is prominently used for Object Detection purposes. It was originally created by Joseph Redmon and Ali Farhadi in the year 2016.

What is the main difference between Yolo and the other Deep Learning Object Detection algorithms?

The fact that makes Yolo quite different from the other Convolutional Neural Network (CNN) algorithms used for Object Detection is that it’s very fast in detecting the objects in real-time. Yolo involves the inputting of the whole image at once and the image passes through the Convolutional Neural Network only for one time. This is the main reason behind naming it as “You Only Look Once”. In the other algorithms this process happens again and again, i.e., the image passes through the CNN again and again. This gives a really high speed object detection advantage to the Yolo algorithm over the other algorithms. Imagine, a car equipped with a self driving feature that is using a normal CNN object detection algorithm. The car would brake itself when the algorithm would observe an obstacle in front of the car. Now, here the algorithm is going to be slow and it would notice the object as an obstacle quite late. It could easily lead to an accident. Now, imagine the same condition with Yolo algorithm. This time the car is equipped with Yolo algorithm, it will stop well in time as it would detect the obstacle very swiftly in real-time.

How does Yolo work?

The Yolo algorithm has been already trained on a particular kind of data set that consists of 80 different kinds of classes. The list of the classes is below:

The Yolo algorithm is capable of detecting all these 80 kinds of objects in an image. It could be also trained in a custom manner to detect new variety of objects very easily. The data set that has been trained for the 80 classes is quite famous by the name “Coco” data set.

We will talk about the working of Yolo algorithm in a very brief and less complicated manner as this article is a beginners friendly one!

First, an image that has been inputted into the network is divided into gridlines. Here, we will take an example of a 3×3 grid matrix.

So this image has been given a 3×3 matrix format. There are a total of 9 grids in this picture. Each grid has some parameters. If we suppose that the total number of classes for which we are looking in the picture is 3, supposingly person, car and plane, then each grid would have a total of 8 parameters. Now, how come there are 8 parameters? It’s because each grid takes 5 more parameters including the three class parameters. These 5 parameters are listed below:

Now, we could only comment about these parameters when we know that what are bounding boxes. So, while preparing the training data, we have to highlight the object that we actually want to detect in the image. We do this with the help of bounding boxes. Basically these are squares or rectangles that highlight a specific part of the image. Here is an example below:

Here, we want to detect cars in this image so we put bounding boxes around all the cars present in the image. Now we have to know that what were those 5 parameters that belong to each grid in the 3×3 grid image. So, we have one more picture here, in which we have taken one particular grid where a car is actually present.

The red colour dot at the the middle is the center of the bounding box. The horizontal blue arrow line is the parameter ‘tx’ that is the distance between the red dot and the leftmost part of this grid. The vertical blue arrow line is ‘ty’ that is the distance between the red dot and the uppermost part of the grid. The horizontal white arrow line is the width of the bounding box with respect to the grid and is stated as ‘tw’. The vertical white arrow line is the height of the bounding box with respect to the grid and is stated as ‘th’. The parameter ‘po’ also known as the objectness score is the probability of detecting an object successfully in a particular bounding box and yes you guessed it right, the value of 0.99 in the first picture of this article is the objectness score of having a face in the bounding box sorrounding it. The class scores ‘p1’, ‘p2’ and ‘p3' tell us about the probability of that object to be a person, car or a plane respectively. All the 9 grids present in the 3×3 matrix have these 8 parameters and these 8 parameters helps the Yolo algorithm to detect the object accurately.

Conclusion:

So, there are many more other complexities that are involved in the working of Yolo algorithm for proper detection of objects. We just wanted to get a brief and a layman form of idea about the working of the algorithm. I hope that this article was helpful to you!

Yolo Algorithm (The Layman’s Approach)

What is Yolo?

What is the main difference between Yolo and the other Deep Learning Object Detection algorithms?

How does Yolo work?

Conclusion:

Written by Kartikeya Rawat