Introducing YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems

Deeplite
8 min readOct 25, 2023

--

We are excited to release YOLOBench, a latency-accuracy benchmark of over 900 YOLO-based object detectors for embedded use cases (Accepted at the ICCV 2023 RCV workshop, you can read the full paper on arXiv).

TL;DR — Check out the interactive YOLOBench app on HuggingFace Spaces where you can find the best YOLO model for your edge device!

Background

Choosing a YOLO object detection model in 2023 is more difficult than it may seem. Since the YOLOv3 architecture was introduced back in 2018, a whole zoo of improved YOLO-based object detection architectures has been introduced from YOLOv4, YOLOR, and Scaled-YOLOv4 to most recently YOLOv8, YOLO-NAS and GoldYOLO. While these models constantly push the Pareto frontier hitting the optimal balance of inference latency and detection accuracy on benchmark datasets such as MSCOCO, it is often not clear how a given model would perform on specific target domain data and what prediction latency per image it would reach on the target hardware.

Below are good examples of plots used to compare architectures from the YOLO series with each other:

Comparison on MS COCO min-val on a V100
Figure 1 — Source: Alexey Bochkovskiy on Twitter
Figure 2 — Source: Ultralytics YOLOv8 repository

or maybe like this one:

Figure 3 — Source: arxiv.org/abs/2301.05586

Looking at these plots, it is not obvious which of the YOLO models is best suited for a given use case. All the benchmarks above report MS-COCO minival mAP@0.5:0.95 numbers (on 640x640 images), but the latency numbers presented are collected from a V100 GPU in one case, an A100 GPU in another, and finally a Tesla T4 GPU. But what if the use case is to deploy a model to detect shelf items in a retail store on a Raspberry Pi 4 board with TFLite or on a board with a specialized accelerator chip, like Orange Pi or Khadas VIM3? The numbers above don’t translate well to an embedded use case with a model running on very different hardware.

Furthermore, distinctions within the YOLO series extend beyond mere variations in neural network architecture. They encompass diverse aspects of the training process, from distinct implementations of the training loop, various training strategies, loss functions, and data augmentation techniques, to a wide array of hyperparameter configurations. For example, some of the YOLO models, like YOLOv5 and YOLOv7, have anchor-based detection heads, while others like YOLOv6 and YOLOv8 are anchor-free. This poses a multi-dimensional complexity for anyone who would attempt to disentangle the effects brought about by the actual architectural advances (e.g. RepVGG-style blocks and ELAN blocks of YOLOv7) vs. the many other factors that also play a role in the final mean average precision score. Add to this the fact that we still don’t have the latency numbers we get for all these models on embedded boards. The hardware where we want to deploy our model!

To summarize, there are several issues one should consider when thinking about the generalization of existing benchmarks:

  • Approximate proxy metrics for latency such as the number of parameters or the MAC count are used in some benchmarks, while others report inference latency on server-grade GPUs such as V100 or A100. Both of these approaches do not directly translate into the actual latency on embedded devices (where inference can be done on edge GPUs, CPUs, NPUs etc.),
  • Mean average precision scores are reported on the COCO minival or test dataset, which might not directly map to data from a specific production domain,
  • Some architecture parameters are considered fixed during benchmarking (like the input image resolution, which is usually fixed at 640x640 for COCO), but are known to act as useful scaling “knobs“ that move the model in the accuracy-latency space (refer to the TinyNet paper for more details on this),
  • Finally, as we mentioned above, many things are different between model families even within a single reported benchmark, starting from training loop implementation, all the way to using extra data (e.g. pre-training on the larger Objects365 dataset) or pseudo-labelling the extra unlabelled images from COCO dataset.

Introducing YOLOBench

To address the challenges of how modern YOLO models compare against each other on embedded hardware in fair and controlled conditions, we have created YOLOBench, a benchmark dataset of latency and accuracy values of over 900 models ranging from YOLOv3 to YOLOv8 architectures on 4 datasets and 5 initial hardware platforms. We have focused our attention on different backbone and neck topologies of YOLO models, while fixing all the other dimensions in the possible solution space, such as the training code, the hyperparameter settings, the detection head structure, and loss function. This enabled us to separate the effect that YOLO backbone and neck have when all other training pipeline features are fixed. To accomplish this, we used a state-of-the-art training pipeline provided by Ultralytics for YOLOv8, combined with the same detection head and loss function from that codebase.

While using other training pipelines or tricks might result in better results for a given YOLO model, our approach was to evaluate how different backbone-neck structures introduced in the YOLO series compare against each other when other pipeline components are fixed. We wanted to answer questions such as does most of the gain in latency-accuracy trade-off come from the DNN topologies proposed in YOLOv6, v7, and v8 compared to older models like YOLOv3 and v4?

There is quite a lot of innovation that came with each new YOLO architecture in terms of the backbone and neck structure:

Figure 4 — Model architectures and hyperparameters used to generate the YOLOBench search space

Each model in the series is distinct in the DNN blocks used to build the architecture such as the usage of RepVGG-style blocks in YOLOv6 to the more efficient ELAN and C2f bottlenecks of YOLOv7 and YOLOv8. There are usually only a few model variations for each YOLO in the depth-width space considered, such as the n, s, m and l models in YOLOv5 and YOLOv8 releases. With YOLOBench, we have expanded the coverage by looking at 12 depth-width variations, each one at 11 different input resolutions, from 160x160 to 480x480. This results in 132 models per architecture in the YOLO series.

Applying this approach to all the different YOLO families considered, this resulted in about 1000 architectures in the search space. We filtered out some of these models by training the whole search space on the VOC dataset from scratch and only including the models close to the latency-accuracy Pareto frontier. The selected models were then pre-trained on COCO, as is typically done, and then fine-tuned on 3 different downstream datasets (VOC, WIDER FACE, and SKU-110k), each with different input resolutions.

The inference latency values were measured on an initial set of 5 different hardware platforms:

  • NVIDIA Jetson Nano GPU (ONNX Runtime, FP32)
  • Khadas VIM3 NPU (AML NPU SDK, INT16)
  • Raspberry Pi 4 Model B CPU (TFLite with XNNPACK, FP32)
  • Intel® Core™i7–10875H CPU (OpenVINO, FP32)
  • OrangePi NPU (RKNN-Toolkit2, FP16)

The Result

In our YOLOBench benchmarks, we made a fascinating discovery. Your choice of hardware matters! We dug deep into the performance trends of YOLO models and found that the perfect choice isn’t just about what you’re trying to do but also where you plan to run it. Instead of a one-size-fits-all, there are many sweet spots in YOLO-based object detection models. In addition, adjusting things such as depth, width, and resolution (in a smart way!) can have a big impact on accuracy and speed.

So, without further ado let’s look at the results starting with the VOC dataset:

Figure 5 — VOC dataset results on different hardware

Quite a lot going on there! Each point in the above plot represents one model (one architecture at one input resolution). What if, for a better visualization, we only plotted those points that belong to the Pareto frontier in the latency-accuracy space?

Figure 6 — Pareto Frontier, VOC on different hardware

Now we can see some peculiarities in these data. Here’s the juicy part: different hardware platforms have their own set of best models. As you can see, there is quite a lot going on in the optimal model set! Some patterns show YOLOv7 models being prominent in the higher mAP region on all hardware except on the VIM3 NPU, where it was YOLOv6–3.0 that shone! Interestingly, by leveraging a modern training pipeline and a new detection head we also see some YOLOv3 and YOLOv4 models on the Pareto frontier! The reality is that there is no single winner here. Your choice for your optimal model depends on your task and the hardware you will deploy to.

Here is another good example. Below are the results for the SKU-110k dataset. This is a specific dataset designed for retail item detection. Clearly a domain quite different from VOC and MSCOCO:

Figure 7 — Pareto Frontier MS COCO on same hardware as Figure 6

Quite the difference! The YOLOv7 models shine less, but more YOLOv5 models appear in the optimal set. On the VIM3 NPU, YOLOv6 models clearly take the lead, while they also maintain a significant presence on ARM CPUs within the Pareto frontier. On the other side, the GPU and Intel CPU plots reveal a more diverse landscape. Surprisingly, the newer, V8 models do not dominate these scenarios.

Conclusion

There is no silver bullet in YOLO-based models. The best pick strongly depends on your use case, hardware choice and how you tweak things like depth-width and input resolution. The insights available in YOLOBench just might be your secret recipe for finding the perfect model. You can get more tips and tricks in our paper or check out our YOLOBench interactive HuggingFace spaces application! Select your preferred dataset, target hardware and see the results across the YOLO-based models with instructions on downloading and using your preferred model.

Want to add your hardware to YOLOBench?

Our initial set of benchmark hardware is just a start! Showcase your hardware with YOLOBench users by adding results from more models and datasets on different set of hardware devices. If you’d like to have your hardware benchmarked on over 900 YOLO-based object detectors you can find the instructions here and we’ll get right back to you!

GET MY YOLOBENCH HARDWARE KIT!

Tune in to the next blogpost in this series to see how YOLOBench data could be used to discover new YOLO-like architectures, which are optimal for a specific embedded device of interest!

Original article authored by:
Ivan Lazarevich Ph.D., Deep Learning Engineer Lead, Deeplite, Inc.

--

--

Deeplite

Deeplite accelerates NPU adoption and AI application deployment with pre-optimized models bundled with essential tools for training and deployment.