Bottlenecks in embedded computer vision

Anton Maltsev
6 min readOct 27, 2022

When I speak with guys that started developing computer vision on edge just recently — I often encounter a curious misconception:
If I take an NPU (GPU|TPU|CPU) that is twice as fast, my solution will run twice as fast.
Ha-ha! Let’s talk about why this is usually false. And find where is hiding maximum performance!

Why does this happen? The main reason is that the bottleneck is not in the accelerator. Let’s examine the general scheme of data processing for computer vision tasks:

  1. Obtaining an image
  2. Preparing the image to be loaded into the neural network
  3. Sending the image to the calculator
  4. Getting the result from the calculator
  5. Postprocessing
  6. Based on postprocessing, it is possible to return to step 2 (for some of the tasks)

Let’s look at the process in detail.

Image Acquisition

The data can be retrieved from a directly connected camera, a network, or a local disk. Let’s look at them:

By CSI interface. The standard provides speeds up to almost 10GBit|s. This is almost 5 thousand frames for black-and-white images at 640*480 resolution per second!

But the reality depends on the camera controller and the controller on the board, the bit rate of the image, etc. And on CSI generation, of course. On the RPi, the CSI transmits 90–200 frames per second. Depending on the version of the camera:

Official RPi documentation

Also, do not forget that there are different CSI camera drivers, which will also depend on the FPS. For example, if you pick up a picture through OpenCV, the speed will be three times less.

If you decide to put two cameras — it will not be possible on any system. From the mass boards, I know only Jetson. For the rest, there are no ready-made solutions on the market. And there the speed of most CSI cameras will be limited to 60fps:

One of the board on the market

As a result — for scaling such a system, the biggest problem is connecting a new camera/increasing FPS. The CPU can also be an issue, but for CSI, it’s rare.

USB, GigE. Unlike the CSI interface, the problem of USB is more CPU consumption (you can see the comparison here). It is possible to find a modern USB camera that produces 500 FPS at 640*480. But you probably can not connect two separate cameras to the system. Usually, the different USB ports are on the same controller. If you connect a camera, the access speed may drop significantly. You must use PCI-e to create additional controllers to connect the new camera.
The gigE camera will require a special port, but functionality is about the same as USB.

As a result — for scaling such a system, the most dangerous are two problems:

  • Not enough CPU
  • Hardware will not be able to connect another camera while maintaining FPS

LAN. You can receive a stream from the camera and individual images over the LAN. On average, a LAN provides lower FPS than other cameras (comparable quality settings). The following Bottlneck may occur when using a LAN:

  • Network load
  • The controller is not able to get multiple streams.
  • Unlike CSI and USB, LAN usually carries a compressed stream. Most modern processors use a hardware implementation for decompression. And it can be a Bottleneck too. Several times, a system that processes dozens of cameras received an overload during decompression.
Somewhere on the left — the h265 decompression module

There are several other protocols that are more hardware oriented. For example, DCMI for STM32, etc. But they are very specialized and require different device designs.

I didn’t mention a few other protocols, like CameraLink, etc. But in practice, I either did not meet them at all or met them many years ago.

It should also be mentioned that you can receive images from permanent memory. It could be HardDrive/flash. And often, there are limits to memory or controller performance. Don’t forget to take this into account! Here, for example, perhaps this particular case.

Data Preparation and Postprocessing

We got the frame in memory. Now we have to prepare it. The preparation can include the following:

  • Frame decoding (discussed above for LAN cameras)
  • Changing resolution (interpolation, for example)
  • Input normalization for neural network (-1..1, 0..1, 0..255, e.t.c.)

At the same time, if a cascade approach is used (e.g., Detection->Crop->Processing->Transmission to Neural Network), then additional operations on Crop/Preparation/Preprocessing are possible.
Additional processing operations may be used if Depth, Stereo, and Thermal cameras are used (bird-view preparation, e.t.c.).
Almost the same bottlenecks will appear in postprocessing:

  • Non-maximum suppression
  • Heatmap analytics
  • e.t.c.

The same can happen for more complex algorithms (optical flow, SLM, e.t.c.).
And, of course, the weaker the processor, the more complicated the algorithm, and the longer these operations will take.
Often optimizing the processor code gives a better performance booster than simplifying the neural network.

Special mention should be made of the NVIDIA platform (and other GPUs). You can do all the preprocessing/postprocessing on the GPU for them. When we worked with the first version of OpenPose in 2016–2017, we could bust the algorithm almost five times by porting the code. At the same time, the FPS for the neural network decrease.

Sending an Image to an Accelerator

There are a lot of different approaches to neural network computation. There are CPUs, GPUs, NPUs, and so on. Sending can be implemented in different ways. For example, if it is a GPU, the data is first sent to the GPU memory, and then the calculation is started. Some NPUs have internal memory; some are directly connected. See, for example, how different inference speeds are for several fast and slow boards:

Jetson Nano has big latency but maximum performance (the picture from my previous article)

Is there any way to optimize it? Yes, for some of the platforms.

For example, for GPU-based platforms. With NVIDIA you can write code in CUDA, or in some frameworks like DeepStream. And exclude the procedure of copying data unnecessarily (CPU->GPU->CPU->GPU).

Or you can solve this by the hardware. For example, OAK-series cameras directly configure pipelines in Movidius to avoid unnecessary transfers.

Here are small tricks that can improve speed:

  • If you have small neural networks, it may be worth choosing NPU where memory access is faster.
  • If you have a GPU, it may be worth minimizing data transfer/loading everything at once and in advance

****

The same applies to downloading from the accelerator. The more operations you perform on the accelerator, the less you need to transfer back.

Afterword

I hope this article helps you!
You can subscribe to my channel on youtube or on my LinkedIn.

And if you want to ask more or have an interesting project — feel free to write me and our team!

--

--