Is YOLO v10 the best one SO FAR?

Hashim Kalam
6 min readMay 28, 2024

--

YOLO v10

YOLO (You Only Look Once) is renowned for being one of the fastest object detection algorithms available today. Its speed and efficiency have made it a standard way for Object Detection in the field of Computer Vision (CV).

YOLO can process images in real-time, making it ideal for applications such as autonomous driving, security surveillance, and retail analytics.

How Does YOLO Actually Work?

Consider the scenario of image classification where the goal is to determine if an image contains a dog or a person.

When it comes to image classification and we should determine whether or not its a dog or a person — dog will be 1 and person will be 0 as in the above image — only the dog is present!

But in object detection algorithms — we consider something called the object localization. Along with the data for the above expained image classification we also pass in the bounding box.

Bounding Box

{Pc, Bx, By, Bw, Bh, C1, C2}
{1 , 50, 70, 60, 70, 1 , 0}

Pc — probability of the class, if both classes aren’t present it will be 0 else 1.

Bx and By — center cordinates of annoted box — which will be present covering exactly only the class — in this case the dog.

Bw and Bh — width and height of the annotated box.

C1 — dog class — since dog is present — it is 1.

C2 — person class — since person is not present — it is 0.

Obviously is there is no object — Pc would be 0 and the rest will have no value!

What if we have more than one object in an image?

When multiple objects are present in an image, YOLO divides the image into a grid and then predicts both bounding boxes and class probabilities for each grid cell. Allowing the model to detect and localize multiple objects simultaneously.

For instance, if the grid size is 4 by 4, each cell will produce a vector of predictions. Assuming each prediction vector consists of 7 units (Pc, Bx, By, Bw, Bh, C1, C2), the overall prediction tensor will have a size of 4 by 4 by 7.

The same approach is used if an image contains overlapping objects, such as a person holding a dog.

Training the Neural Network

As both image and its corresponding vectors are obtained — we could consider image samples as input data— and its corresponding vector samples as output data to be passed to the neural network.

Now these data sample with its corresponding input and output matrix could be passed into a neural netork — and the neural network could be adjusted and played with when it comes to the number of nodes in hidden state — the activation function and more! Getting the best combination for the best accuracy rate.

The output would be the size of the grid cell seperation the model has done. If its a 4 by 4 — the output would be a vector of 16.

Reason why its YOLO — You Only Look Once — is because the model makes all predictions in a single forward propagation pass. This efficiently allows the model to detect patterns and objects quickly, regardless of the number of grid cells.

Issues with Object Detection Models

Although YOLO could be considered as one of the best models to use for object detection — all models aren’t perfect. When it comes to YOLO there might be issues with overlapping bounding box.

Overlapping Bounding Boxes

Consider the above image which has two objects: a person and a dog. YOLO might initially detect multiple bounding boxes for these objects, in above context — resulting in five bounding boxes where there should ideally be just two (one for each object).

Bounding Box Overlaps

Overlapping bounding boxes occur when multiple predictions cover the same object. This redundancy needs to be resolved to ensure the model outputs the most accurate and minimal set of bounding boxes.

Intersection Over Union (IoU)

To address overlapping bounding boxes, YOLO uses a technique called Intersection over Union (IoU). IoU is a metric that measures the overlap between two bounding boxes.

IoU = Intersect Area / Union Area ​
  • Intersection Area: The area where two bounding boxes overlap.
  • Union Area: The total area covered by both bounding boxes combined.

If two bounding boxes completely overlap, the IoU value is 1. If they do not overlap at all, the IoU value is 0.

Non-Maximum Suppression (NMS)

To eliminate redundant bounding boxes, YOLO applies Non-Maximum Suppression (NMS):

  1. Calculate Confidence Scores: Each bounding box is assigned a confidence score representing the likelihood of the object being present.
  2. Select Highest Confidence Box: The bounding box with the highest confidence score is selected.
  3. Suppress Overlapping Boxes: Any bounding boxes with an IoU above a certain threshold (e.g., 0.5) with the selected box are suppressed (i.e., removed).
  4. Repeat: This process is repeated for the remaining boxes until only the most confident, non-overlapping boxes are left.

However, YOLO v10 eliminates the need for NMS by using consistent dual assignments during training. This innovation reduces computational overhead and latency while maintaining high accuracy, making YOLO v10 faster and more efficient for real-time applications​

YOLO v10: Performance and Comparisons

The latest version of YOLO, YOLO v10, stands out as the best and most improved version for several compelling reasons — here’s why:

  1. Higher Accuracy — considering the opening first graph — we could notice that the model produces a higher COCO AP level compared to the previous versions and even other models! Clearly indicating — YOLOX models (particularly YOLOX-L) — have the best performance overall.
  2. Improved Latency — (latency is the time taken for the model to respond for an input) Despite even with higher acccuracy — YOLO v10 maintains low latency — making it flexible and beneficial for real- time applications!
  3. Better Object Localization — main challenge in object detection is handling overlapping bounding boxes. Fortunately, YOLO v10 models incoperates advanced techniques such as IoU and NMS to handle overlapping box effectively for a better bounding box prediction. Leading leads to an accurater object detection model !
  4. Lower Parameter Count — in the opening second graph — we could conclude that YOLO v10 models could predict accurately with lesser parameters. Making it suitable for deployment in resource-constrained environments.

Although we could specify a single YOLO v10 model to be the best — it all depends on the use cases.

Below is a list of all the variants of YOLO v10 models — each having its own strengths.

zYOLO v10 Model Variants

Conclusion

While YOLO v10 has shown a jump in its accuracy, latency and efficiency — it all depends on which use case you are working under as for each use case there could be a model best fit for it! Nevertheless, based on the evidence provided — YOLO v10 stands out as a leading object detection model!

For more infomation do check on YOLO v10 official github and paper.

--

--

Hashim Kalam

The more you learn, the more you realize how much you didn't know 🚀 Website - https://hashimkalam.vercel.app