Hardware collaboration goes far beyond hardware
Can hardware accelerators deliver the performance they claim without deep collaboration?
Written by Márton Fehér
Unless hardware accelerators are designed with the rest of the system, software, and algorithms in mind, great benchmark results often fail to translate into great production solutions.
Hardware Accelerators make things go faster — right?
Over the years, chip designers have realized that while CPUs can execute almost anything, that comes at a significant price. As a result, accelerators are usually used for well-understood applications such as video, graphics, cryptography, or data compression. By having dedicated hardware designed to execute just one task, substantial increases in performance, reductions in power, and/or reductions in silicon area, can be realized, sometimes delivering orders of magnitude benefits.
If the accelerator is not designed to work with everything else in the system, the claimed benefits are lost.
But hardware accelerators that deliver blisteringly fast results for benchmarks, so often disappoint application developers when they try to realize similar performance for their applications. This is because, if the accelerator is not designed to work with everything else in the system, the claimed benefits — showcased by highly specialized benchmarks — are lost. These bottlenecks can be caused by a variety of factors such as insufficient memory bandwidth, poor prioritization of shared resources, or inefficient interfaces to the rest of the application.
Since AImotive develops a broad portfolio of automated driving software, hardware, algorithm, and other technologies, we are fortunate to have several complementary skillsets under one roof. We use that know-how to help us ensure that whatever we design is appropriately thought through for real applications, not just benchmarks. And since AImotive came from a graphics and SoC benchmarking background, we understand the difference!
The simple fact is: no matter how good a NN accelerator is, for a real-time automotive inference application, it must work perfectly with all the other parts of the system. Unless the complete system is engineered as a whole, any one part can destroy the performance of everything else.
A simple example of this is having a NN accelerator sharing memory with a CPU. A high-performance 64-bit CPU cluster places enormous demands on external DRAM. Studies have shown that up to 50% of CPU performance can be lost — i.e., the CPU is stalled — merely waiting for data from caches. How frustrating to put all that work into designing a 2GHz CPU, only to find it runs no faster than a 1GHz version.
So, imagine what happens when a 20 TOPS NN accelerator shares memory with the same CPU cluster? For example, if the NN engine needs to use the DRAM shared with the CPU for intermediate calculation results, it will consume GBytes/s of additional memory bandwidth. If that slows down the CPU cluster, it slows down the entire system regardless of how fast the NN accelerator might otherwise go.
As a result, it might all look great on paper, but massively underperform in practice. And it only gets more challenging when the CPU cluster is on the same SOC as a GPU cluster — another big memory bandwidth user.
So, dataflow through the entire system and the timing and prioritization of that dataflow is a massive challenge for any systems integrator. That’s why anyone designing an NN accelerator must consider how it integrates with the host CPU and the rest of the hardware systems, or risk causing more problems than it solves.
Since a hardware accelerator is, by definition, high performance, it can only work as fast as data is fed to it, and the results are used by the rest of the application. This simple fact of life is often missed when hardware designers focus only on making hardware capable of delivering exceptional performance.
This is especially the case for NN (Neural Network) accelerators like aiWare. You need to understand where the data is coming from; where it is going; its burst timing and latency characteristics; how the software interacts with the NN processor, and many other factors. Each of these impacts the accelerators’s ability to do its job in a real application.
That’s why aiWare has dedicated external DRAM for larger configurations. Perhaps more importantly, that’s why it has significantly more on-chip SRAM, distributed between every MAC, to ensure data is always flowing smoothly. Indeed, aiWare has as much as two orders of magnitude more on-chip memory bandwidth compared to other well-respected engines claiming similar TOPS performance. This attention to detail for dataflow both on-chip per clock cycle and off-chip, and how it is shared with the host CPU, is why aiWare can confidently claim up to 95% sustained efficiencies for vision-related workloads. It also ensures that aiWare puts the minimum possible demands on the host CPU and the rest of the hardware system.
Prototype to Embedded
If you believe hardware vendors, it is easy to move a trained NN from a development framework to an embedded platform: just click the button in the SDK, and it’s done!
If only life were that simple. In reality, the task of porting any NN trained in FP32 running on an unconstrained PC or server CPU to a highly constrained embedded SoC with hardware accelerators is an incredibly challenging and nuanced task, no matter how good the hardware platform is. It’s not just about tweaking some parameters: NNs almost always need to be substantially redesigned in order to work well in the limited embedded environment. In these, a host of sensors and other subsystems feed real data to the substantially lower performance CPU, which has far less memory and must align with tough time and power constraints.
That’s why we ensure that customers of any aiWare licensee gain access not only to our hardware support team but also our NN research teams, automotive AI software development teams, and test vehicle integration teams’ expertise. We have many years’ experience moving NNs from research into deployment. Our partners’ customers use AImotive to help them port NNs onto their target platform. Because we can bring hardware, software, and NN algorithm engineers together whenever we need to, we can help our customers solve the challenges that make transitioning from prototype to production so tough.
Great hardware is about far more than great hardware engineers.
It only becomes excellent when creative collaboration happens…
between great hardware, software, systems, algorithms, safety, validation, and other engineers to ensure that everything works together, all the time, in every scenario.