AI Accelerators — Part V: Final Thoughts

7 min readDec 5, 2021

So — Who Has the Best AI Acceleration Solution?

This question is obviously complicated. There is no crystal ball about what will happen 5–10 years from now. Furthermore, if you want to know how the industry will look like — don’t ask an engineer; we know about technology, but the technologically-superior products do not always end up as the prevalent ones.

Many great ideas were implemented in the past five years, but even those are a fraction of the staggering number of AI accelerator designs and ideas in academic papers, there are still many ideas that can “trickle-down” from the academic world to industry practitioners. These things are also pretty random; it is also possible that sometime in the future, a new study will present a “breakthrough model” that delivers state-of-the-art accuracy in existing cases or new domains we have not yet discovered. That model might be a “killer app” for a specific accelerator, and once that model becomes an industry standard, it will make that particular accelerator more appealing.

Furthermore, the devil is always in the details; online sales pitch material and specifications shine on the good parts of each architecture, but they are not very revealing. The weaknesses only surface when the vision gets materialized to an actual product, and they stem from many hard-to-predict factors. My bet is that over the next 2–3 years, we will see the continued disaggregation of ideas and solutions, but around 2024–2025, things will dial down, and we will see both research and commercial AI starting to converge to a handful of existing solutions and best practices, with about 3–5 accelerated computing companies leading the pack.

Takeaways

The AI acceleration game is in full gear. There is no doubt that currently, NVIDIA has the upper hand due to their mature CUDA software stack and a head start in MLPerf benchmarks (let’s not forget almost all of these benchmarks were designed and optimized on NVIDIA platforms), and it also has the means to control the entire system stack with acquisitions like Mellanox and maybe ARM (or maybe not?). However, there is plenty of room for innovation; the main bets of startups are on NVIDIA being in the “innovator’s dilemma.” In the long haul, there’s a difference between adapting an existing solution while keeping the original features and foundations and building an entirely new solution from scratch. Therefore, while NVIDIA has a few good years of a head start, once a few of those startups catch up, they will give NVIDIA a literal run for their money. Here are a few guidelines for the contenders in the race, as well as newcomers:

Be mindful of good architectural foundations based on solid research and long-term vision, and think about as many details of the target application space as possible. Although it might seem overly simplified and optimistic, I cannot stress this hard enough. Based on some of my own experiences observing the landscape, even with significant funding and large, talented teams, both corporations and startups struggle to overcome sub-optimal architectural decisions made at an early stage, even after multiple generations of chips. Every architecture has a few weaknesses stemming from many reasons.
The weaknesses that are relatively easy to fix stem from underestimating resources (for example, not enough cores, too small scratchpads, etc.) Some weaknesses result from the lack of “one-size-fits-all,” which forces vendors to choose one technology over the other, sometimes as a by-product of real-world constraints. For example, the NVIDIA A100 relies on high bandwidth memories (HBMs), which have a higher bandwidth but lower capacity (10s of GB) compared to traditional DRAMs (up to a few TBs). However, this is not a weakness, but more a conscious design decision made at the time since most AI applications were more bandwidth than memory hungry. Therefore NVIDIA GPUs enjoy fast memory accesses for workloads that fit inside the HBM space of 10s of GBs. However, to train large models, e.g., billion parameter language models like GPT3, users need to distribute and communicate data across a potentially large number of GPUs just to have enough memory space to contain all processed data.

When designing a new architecture, don’t downplay the complexity of software stack and toolchains development: The weaknesses that are probably hardest to fix are wrong choices done in the fundamental stages that formed your architecture. For example, Wave Computing, SambaNova, and SimpleMachines are taking seemingly similar approaches (Compiler-driven Dataflow execution for a reconfigurable accelerator). Still, while Wave Computing filed for bankruptcy and, according to LinkedIn, all of SimpleMachines’s leadership is no longer there, SambaNova seems to be doing well. It’s harder to know the main reason for the difference (as there are probably multiple significant contributors), but based on some available online material, Wave Computing designed their DPU such that the hardware is simple and abstract enough to support a wide range of applications and considered achieving a well-performing solution becomes a software problem. Since the software cycle has a much faster turnaround than taping out a new chip, they decided to leave most of the heavy-lifting to software. However, they discovered that while some compilation techniques worked well for simple kernels, it was hard to generalize and became computationally intractable for large programs. The lesson learned here is that in the early stage, architecture must be carefully defined with people from multiple layers of the stack. Furthermore, from an architectural standpoint, their design was very ambitious: toggling circuits at very high speeds (6.7GHz), combining slow and fast memory spaces (DDR and HMC), etc. all these underlying details needed to be abstracted away and controlled by a very elaborate software to orchestrate the dataflow efficiently. People sometimes gloss over the complexity of software. However, as you design more computationally intensive systems, even just a few abstracting assumptions that make the hardware simpler or more general-purpose can result in months of tedious software development. Arguably, it seems like the Wave Computing case is what happens when you are too far ahead of the curve. Most of their architectural foundations were probably laid around 2015 when compiler-driven AI designs were in their infancy. With limited accumulated knowledge, the Wave Computing leadership might have made some optimistic assumptions on software development just because they were tackling new and unsolved problems.

Utilization is All You Need. Accelerators drive most of their appeal from one concept: parallelism. The ability to employ parallelism is the number one contributor to performance. The rest is just details, but like I said — details matter. You often hear about TOPs, TFLOPs, GOPs, etc., which refer to how many arithmetic operations the accelerator can theoretically perform in a second. However, the main challenge in the landscape is utilization, i.e., how many operations you can practically get per second since it determines what the users really see. Back in undergrad, I drove a 15-year-old Renault Clio, and it could theoretically go 150mph with no gas consumption, but only if you threw it off a cliff. While the maximal throughput determines how fast you compute when all arithmetic units on the chip are used, that is hardly the typical case. Often, you might find that the chip is busy doing other things like synchronizing between different computation units, fetching data from the off-chip memory, or communicating data across units and chips. To increase utilization, we need to avoid these overheads by building a sophisticated software stack capable of predicting and minimizing the impact of these hardware events for all real-world scenarios, all different neural architectures, and all tensor shapes. That’s why many AI “hardware” organizations have at least as many software engineers as they have hardware engineers.

Everything Old is New Again. As you might have noticed, almost every overviewed AI acceleration solution started from an academic idea that was decades old: systolic arrays were introduced in 1978, VLIW architectures in 1983, the concept of dataflow programming date back to 1975, early processing-in-memory works were presented in the 1970s. One reason for this “conservativism” is that in such a competitive and fast-moving landscape, bold new architectural ideas are too risky, and people favor well-rounded ideas that have matured and been implemented in other contexts. The other reason is that, much like the AI renaissance, the hardware renaissance cannot only be attributed to algorithms and ideas, as there have been great ones over the years. It is the ability to materialize these ideas that brought innovation to our doorstep; it took the combined effort of: (i) chip material sciences and fabrication methodologies to make silicon chips orders of magnitude more performant and (ii) accumulated human knowledge that drives better understanding of CAD tools, programming languages, and compilers to master the modern, extremely-complex hardware-software stack.

On a personal note, I hope that you enjoyed reading this series, at least as much I enjoyed writing it. I think that these are great times for AI enthusiasts and practitioners anywhere, regardless of whether you are a hardware or software engineer, junior or senior computer scientist, applied mathematician, or technology aficionado. Times are moving fast. It would be interesting to revisit this series, say in just five years from now, and see which of these write-ups stood the test of time.

Previous Chapter: The Very Rich AI Accelerators Landscape

Series Follow Ups:

Ushering in The Next Wave of AI Acceleration
What’s next for AI Accelerators?
The AI Accelerators Blog Series: a Year‘s Retrospect
Just a year went by, and a lot has changed…

About me

AI Accelerators — Part V: Final Thoughts

So — Who Has the Best AI Acceleration Solution?

Takeaways

Written by Adi Fuchs