YOLO: Intuitive and Relatable Explanations

Ayush Raj
15 min readFeb 20, 2024

--

This is the 3rd Installment in the Object Detection Series. In case, you missed the previous episodes, don’t worry — you can always come back here to map yourself: Series Introductory Blog.

Introduction

Ever wonder how self-driving cars “see” the road? Or how your phone identifies your face for unlocking? The answer lies in a revolutionary technology called YOLO.

YOLO or You Only Look Once, is a family of real-time object detection & classification algorithms that have gained popularity in the field of computer vision. YOLO family gained this immense popularity due to its ability detect and classify multiple objects in an image or video frame in real-time. Previous methods for object detection, like R-CNN and its variations, used a pipeline to perform this task in two stages. This can be slow to run and also hard to optimize, because each individual component must be trained separately. Unlike them, YOLO, as the name suggests, instead of using two separate stages for box and category prediction, it employs a single stage network to simultaneously address both the problems, thus achieving a higher speed on inference, making it capable to solve real-world problems. For example, if you have a surveillance camera that needs to quickly identify multiple objects, the two-stage approach may struggle to provide fast results due to its sequential nature, potentially causing delays in critical applications while it will give the results in real-time. Recently, its latest version launch YOLOv8 has taken Computer Vision World by Storm by its crazy speed, accuracy and multiple real-time capabilities! YOLOv8 is the eighth version of the YOLO series, and it was developed by Alexey Bochkovskiy and the Ultralytics team.

Okay, so i guess enough of the talk! Let’s deep dive to unveil its groundbreaking technology and its amazing feats.

Basics of YOLO

When YOLO is referred, it is basically YOLOv1. Although most of the basics discussed here remain same for all the versions.

YOLO takes the groundbreaking idea for object detection as to transform the image into a grid of cells, say S*S. Now, YOLO predicts for each for these S*S cells. This takes a significant leap from the conventional methods, like Faster R-CNN where you have to predict for all the proposals by its Region Proposal Network (RPN) which amounts to almost 2000, which is in contrast to what computation needs to be done here because S is generally about 7 to 10. So, YOLO has to predict for only around 100 region of proposals and that too it gets it done in a single forward pass only, rather than having separate networks for proposals and for prediction, and thus being called as Single-Stage Method. Another popular algorithm in this family is SSD. I will be explain it later in some other blog. For time being, let’s focus on YOLO. So, the decrease in network complexity and the no. of proposals laid the ground for Real-Time Detection where processing speed was the major bottleneck.

One Major Idea that give YOLO Models this huge success is their ability to have insights of Proportions. We’ll look into this shortly.

So, let’s look into the finer details with an example.

Say S to be 7 here, just for illustrative purposes.

Now in these 49 cells, 1 cell for each object class takes the responsibility of predicting that object if the midpoint of its Ground Truth Box lies in that cell. Here, in the image above, the cell pointed will predict for Dog.

And for doing that, YOLO uses the concept of Anchor Boxes. These are a set of boxes of different shapes and sizes often in the Aspect Ratios of 1:1, 2:1 or 1:2, used to regress actual bounding box in terms of displacement. YOLO when used initially, it used only 2 such boxes as mentioned in the Research Paper.

The purpose of using Anchor Boxes:

I have a question for you. What will happen in the case when the same cell is the mid-point of 2 or more objects? Lets’ look at this image. What you will say?

Here comes the role of Anchor Boxes, as we know each cell outputs only 1 prediction and in this case, YOLO will fail.

So to counter this, they used Anchor Boxes. With the idea of anchor boxes what you are going to do is predefine 2 different shapes called Anchor Box 1 and Anchor Box 2. By this we can do two predictions with 2 anchor boxes. Allotment of Anchor Box will done on the basis of the value of IoU of that object bounding box with the Anchor Boxes. Generally threshold is taken as 0.5. So, here Box 1 will predict for the Girl and Box 2 will predict for the Car, as both mostly resembles the required shapes.

So, for each Anchor Box used in each cell, it outputs (5 + ‘C’) values, where ‘C’ denotes the number of classes. Let’s see each of these:

The first value represents the confidence score, also called (Box Confidence) (Pc) which to be put simply, tells us two things: what’s the likelihood of an object being in that box, and if there’s one, then how well does it fits the object? Hope you get the intuition. To put it mathematically, it is the product of objectness score (the model’s certainty that the box contains an object at all) (1st Part) with the Intersection over Union (IoU) between the predicted bounding box and the ground truth (2nd Part). And i already assume you all have basic understanding of IoU. For those who don’t know,

How to judge whether predictions of Object Detection Algorithms are correct or not? Here comes the crucial role of IoU into play:

IOU = Area of Intersection / Area of Union

The next four values are for the center coordinates (x,y), width & height of the predicted box.

Then the ‘C’ values simply are the probability distribution over all the classes that gives the likelihood of object in that box being associated to that class (Class Confidence) which simplifies to taking the conditional probability of the class given that an object has been detected (P(Class_i|Object)) multiplying that with the objectness score and the IoU (Box Confidence).

But here’s a catch! The highest in these ‘C’ values should not be interpreted as the output confidence score as usually we do in the case of Image Classification or Detection.

Infact, the Confidence Score that the YOLO outputs is the product of Class Confidence and the Box Confidence which enables it to balance between how certain it is that a box contains an object and how certain it is about which class this object belongs to.

So, the output vector for each cell, having 2 Anchor Boxes as mentioned above will look something like this : [Pc1, x1, y1, h1, w1, c1, c2, c3, Pc2, x2, y2, h2, w2, c1, c2, c3], vector of size (16,1).

And in context of our Project whose details i will be sharing in due course of time, c1, c2 & c3 will be helmet, head & person respectively.

So, the output of YOLO will be a tensor of shape [S*S*K*(5+C)] where K is the no. of output boxes and C is the number of Classes.

Let’s give you an example. Remember the Girl & Car Example in the Anchor Box Section? That Cell’s output will be, assuming c1=girl, c2=car, c3=bike :

Here, (Pc) value is 1 for both boxes, as obvious from the image. And the Box 1 & Box 2 was for girl & car, and so are the results as expected.

Now, I guess we’ve enough to dive into the concept of insights of Proportions. What it means is that any cell of the grid while focusing on its area is aware of the bigger picture. Each cell has some context of it. This means it’s not that the Model only knows area of one cell at a time, it takes into account the whole image, but focusing on that area only. So, each cell is not restricted to predict bounding box inside its area only, it can predict larger than itself using its knowledge of objects and their proportions (x,y,w,h). A good analogy i think to relate this magic of YOLO will be us, Humans, how we can recognize a object just by seeing a part of it. For example, if i show you only the arms of a person, you can predict that this is a person and also draw the probable bounding box because you’re aware of the proportions. Hope that makes sense to you!

So, YOLO breaks down to a combination of regression and classification problem. Hope you already got the intuition before I wrote this! :)

Okay, so now we are going to know NMS which is Non-Max Suppression. It is a post processing step but is as important as the other stages. Let’s Look at this:

I have one more question for you which will serve as the intuition for this: There can be multiple bounding boxes for each object in an image, so how to deal with it?

Multiple Detections for each Objects

Yup, NMS will tackle this. If used without NMS, any model will predict outputs like this. As you’ve already understood that this is not what we wanted. Model also predicts many redundant overlapping boxes.

Okay, but why will it happen?

Because you are running image classification and localization algorithm on every grid cell, see, it’s always possible that many of the cells say that their ‘Pc’ Class Probability or chance of having that same object in that cell is highest. It’s quite intuitive, because see we know that it’s one cell’s responsibility to predict object containing its midpoint but you can think, that nearby cells which are close to the mid-point will also claim that they have the object with varying confidence scores. And hence, you get your answer. :)

So, NMS or Non-Max Suppression, suppresses the ones which are not maximum, and hence get its name. It is a key technique applied per class to make sure your algorithm detects objects only once.

How this Works?

  1. Firstly, it removes all bounding boxes of each class which have scores less than a particular threshold (generally, this value is taken as 0.6).
  2. Then it looks for probabilities (confidence scores) associated with each of these detections for a particular class.
  3. Then, it takes largest score which is most confident detection for that class.
  4. Having done that, the NMS part looks for all remaining bounding boxes of that class and chooses all those bounding boxes which have high Intersection over Union (IOU) with the bounding box of highest score, (generally value of IoU is taken as 0.5) and suppresses them.
  5. Then we look for remaining bounding box and find highest score and again NMS looks for remaining bounding boxes which have high IOU with bounding box of high score and then they will get suppressed.

By doing this for every object we get only one bounding box for each object.

So for this example:

  1. Let’s take the threshold value to be ≤ 0.7.
  2. The box with score 0.6 for Car 1 will get discarded.
  3. It takes largest score which is 0.9 in this case for Car 1 and 0.8 for Car 2.
  4. Now, It checks IoU for all the remaining bounding boxes for Car 1 and Car 2 (i.e. for 0.7 for both Car 1 and Car 2).
  5. Now, NMS will suppress 0.7 for car 1 as it has high IoU with 0.9 for car 1 and 0.7 for car 2 as it has high IOU with respect to bounding box of score = 0.8, so like this we get only one bounding box each for car 1 & car 2 which are highlighted in the image.
  6. It’s just a special case where you got rid off of all the redundant boxes in one go. If that persists, you just have to do all of this in a loop!

So, this wraps up our YOLO algorithm.

Phew, that was indeed a lot! I tried to keep it simpler for you. Hope you get it!

Limitations of YOLO v1 :

  1. Fixed Number of Predictions: YOLOv1 predicted a fixed number of bounding boxes per grid cell ( in our case total 49 predictions are possible because we have 49 grid cells), which might not be optimal for images with varying object counts.
  2. Poor Localization : YOLOv1 had trouble accurately figuring out where objects were in images and how big they were. It often drew boxes that didn’t fit objects well, especially if objects were small or close together. This happened because YOLOv1 couldn’t handle different object sizes very effectively, and its pictures lost some fine details, making its guesses less precise
  3. Limited in Detecting Small Objects: YOLOv1 struggled to accurately detect and localize small objects within images, as its single-scale approach wasn’t well-suited for handling objects of varying sizes.

Evolution from YOLO to YOLOv8

Firstly, I will be going through the evolution from YOLO to YOLOv8 very briefly just to give you an idea. It’s optional, if you’re not interested, just skip this section.

Credit : Encord

YOLO (You Only Look Once), a popular object detection and image segmentation model, was developed by Joseph Redmon and Ali Farhadi at the University of Washington. Launched in 2015, YOLO quickly gained popularity for its high speed and accuracy.

  • YOLOv2, released in 2016, improved the original model by incorporating batch normalization, anchor boxes, and dimension clusters.
  • YOLOv3, launched in 2018, further enhanced the model’s performance using a more efficient backbone network, multiple anchors and spatial pyramid pooling.
  • YOLOv4 was released in 2020, introducing innovations like Mosaic data augmentation, a new anchor-free detection head, and a new loss function.
  • YOLOv5 further improved the model’s performance and added new features such as hyperparameter optimization, integrated experiment tracking and automatic export to popular export formats.
  • YOLOv6 was open-sourced by Meituan in 2022 and is in use in many of the company’s autonomous delivery robots.
  • YOLOv7 added additional tasks such as pose estimation on the COCO keypoints dataset.
  • YOLOv8 is the latest version of YOLO by Ultralytics. As a cutting-edge, state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous versions, introducing new features and improvements for enhanced performance, flexibility, and efficiency. YOLOv8 supports a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking, and classification. This versatility allows users to leverage YOLOv8’s capabilities across diverse applications and domains.
  • And Recently, just a few weeks ago, on Jan 10,2024, They announced the upgrade of YOLOv8, v8.1. In this they introduced the concept of Oriented Bounding Box (OBB) which marks a significant step in object detection, especially for angled or rotated objects, enhancing accuracy and reducing background noise in various applications such as aerial imagery and text detection. Simple intuition is that before we simply make rectangular boxes which may get less precise for tilted objects. This is solution for this problem.

So, you see Innovation is going on and on! This Computer Vision community, The AI community in general never sleeps, haha! But, really a lot is yet to come!

Why should I use YOLOv8?

A few of the main reasons you should consider using YOLOv8 in your next computer vision project are:

  • The Best Reason ever can be it’s almost no-code requirement for implementation.
  • YOLOv8 has better accuracy than previous YOLO models.
  • The latest YOLOv8 implementation comes with a lot of new features, we especially like the user-friendly CLI and GitHub repo.
  • It supports object detection, instance segmentation, and image classification.

The community around YOLO is incredible, just search for any edition of the YOLO model and you’ll find hundreds of tutorials, videos, and articles. I must mention, I learned a lot just by discussing in the GitHub community and from the Documentation they’ve provided. These are best resources one can get. I’ll provide all the relevant links in the References.

  • Training of YOLOv8 will be probably faster than the other two-stage object detection models.

One reason not to use YOLOv8:

  • At the current time YOLOv8 does not support models trained in 1280 (in pixels) resolution, thus if you’re looking to run inference at high resolution it is not recommended to use YOLOv8.

How does YOLOv8 compare to previous models?

The Ultralytics team has once again benchmarked YOLOv8 against the COCO dataset and achieved impressive results compared to previous YOLO versions across all five model sizes.

Source: GitHub

When comparing the performance of the different YOLO lineages and model sizes on the COCO dataset we want to compare different metrics.

  • Performance: Mean average precision (mAP)
  • Speed: Speed of the inference (In fps)
  • Compute (cost): The size of the model in FLOPs and params

YOLOv8 is available in 5 different sizes based on the number of parameters and complexity. They are: nano, small, medium, large and extra large.

For the object detection comparison of the 5 model sizes, The YOLOv8m model achieved an mAP of 50.2% on the COCO dataset, whereas the largest model, YOLOv8x achieved 53.9% with more than double the number of parameters.

We’ve used the medium version (YOLOv8m) for our Helmet Detection Project.

Source: GitHub

Overall, YOLOv8’s high accuracy and performance make it a strong contender for your next computer vision project, just like it did for mine!.

Whether you are looking to implement object detection in a commercial product, or simply want to experiment with the latest computer vision technologies, YOLOv8 is a state-of-the-art model that you should consider.

Next, we will analyze the architecture and design of the model.

Implementing YOLOv8

Let us look at how to use and implement YOLOv8 into your workflows. The model comes bundled with the following pre-trained models that can be utilized off-the-shelf in your computer vision projects to achieve better model performance:

  • Instance segmentation models trained on the COCO segmentation dataset with an image resolution of 640.
  • Image classification models pre-trained on the ImageNet dataset with an image resolution of 224.
  • Object Detection models trained on the COCO detection dataset with an image resolution of 640.

Source: Ultralytics. Example of Classification, Object Detection, and Segmentation.

In the next section, we will cover how to access YOLO via your CLI, python, environment, and lastly in Encord’s Platform.

How do I use the YOLOv8 CLI?

YOLOv8 can be accessed easily via the CLI and used on any type of dataset.

!yolo task=detect \ mode=predict \ model=yolov8n.pt \ source="image.jpg"

To use it simply insert the following commands:

  • task in [detect, classify, segment]
  • mode in [train, predict, val, export]
  • model as an uninitialized .yaml or as a previously trained .pt file
  • Source as the path/to/data

Can I pip install YOLOv8?

Complementary to the CLI, YOLOv8 is also distributed as a PIP package, perfect for all Python environments. This makes local development a little harder but unlocks all of the possibilities of weaving YOLOv8 into your Python code.

You can clone it from GitHub:

git clone https://github.com/ultralytics/ultralytics.git

Or pip install from pip:

pip install ultralytics

After pip installing you can import a model and use it in your favorite Python environment:

from ultralytics import YOLO 
# Load a model model = YOLO("yolov8n.pt")
# load a pretrained model
# Use the model
results = model.train(data="coco128.yaml", epochs=5) # train the model
results = model.val() # evaluate model performance on the validation data set
results = model("https://ultralytics.com/images/cat.jpg") # predict on an image
success = YOLO("yolov8n.pt").export(format="onnx") # export a model to ONNX

What is the Annotation Format of YOLOv8?

YOLOv8 has a simple annotation format which is the same as the YOLOv5 PyTorch TXT annotation format, a modified version of the Darknet annotation format.

Every image sample has one .txt file with one line for each bounding box. The format of each row is presented as follows:

class_id center_x center_y width height

Notice that each field is space delimited and the coordinates are normalized from zero to one.

YOLOv8 annotation format example:

1: 1 0.317 0.30354206008 0.114 0.173819742489 
2: 1 0.694 0.33726094420 0.156 0.23605150214
3: 1 0.395 0.32257467811 0.13 0.195278969957

The data.yaml folder contains information used by the model to locate images and map class names to the class ids.

train: ../train/images 
test: ../test/images
val: ../valid/images

nc: 5
names: ['fish', 'cat', 'person', 'dog', 'shark']

Conclusion

In conclusion, YOLO, and particularly its latest iteration YOLOv8, has revolutionized the field of computer vision with its speed, accuracy, and ease of use. From self-driving cars to surveillance systems, its applications are vast and its potential for future advancements is even more exciting. Whether you’re a developer, researcher, or simply curious about cutting-edge technology, YOLO is definitely worth exploring.

This blog has only scratched the surface of the fascinating world of YOLO. Now that you’re equipped with the basics, why not start experimenting? YOLO’s journey is far from over, and its potential to transform our world is immense.

Do follow and Upvote for more such content!!!

--

--

Ayush Raj

A passionate learner who loves to break complex concepts into simpler explanations. Research Interests include Deep Learning and Computer Vision.