Ushering in The Next Wave of AI Acceleration

12 min readJun 13, 2022

The five-part blog series left me with some thoughts about the state of AI hardware acceleration; what merits drove acceleration’s popularity, what should we be doing better, and what possible future directions should acceleration take to better serve the AI community. Everything written here is based on my observations, so things are obviously debatable, predictions in particular.

Scale-to-Commodity: How Revolutions are Made

Specialized hardware played a key role in (at least) two pivotal moments in the modern history of deep learning: AlexNet and Transformers. The papers presenting both architectures highlight the use of GPUs in their abstracts and elaborate on how they have parallelized the underlying computations into multiple GPUs in order to speed up computations.

Why was that so important? because deep convolutional neural networks for computer vision existed almost two decades prior to the AlexNet paper, and likewise — LSTMs, RNNs, and self-attention mechanisms for natural language processing predated the Transformers paper, so many of the algorithmic foundations were already there. However, by showing how large computations can be done at reasonable costs the authors created a breakthrough for these domains; they captured the revolutionary essence of AI acceleration: Scale-to-Commodity.

The 1000x Network Rate Increase of Routers, Wired, and Wireless Over 25 Years (Source: Tetsuya Miki)

While increasing performance per cost has been a long-term goal of computer scientists and engineers, at certain rates gains become game-changers, making what was once thought to be impossible a commodity. A 100x increase in chip performance means that within two hours you run a complex model that would have previously taken you a week to run, or would have required 100s of CPU-based workstations (assuming you were able to master all this parallelization and pay outrageous electricity bills.) Think of how different our lives have become in the past 20 years due to the 1000x increase in internet bandwidth; can you imagine “bingeing” your favorite show when you have to wait for hours for each episode to download before watching it?

In order to sustain the revolutionary nature of AI, we need to think about what models can be commoditized and how to do it.

The Problem(s) With AI Acceleration Benchmarking

In recent years we have witnessed the rise of MLPerf, a benchmarking suite set to standardize the performance numbers reported by AI hardware and systems vendors. MLPerf is a joint effort by the leaders of the industry, and it is now part of a consortium spanning over 70 leading tech organizations (like NVIDIA, Google, AMD, and Intel), startups (like SambaNova, Cerebras, and Graphcore), and academic institutions (like Stanford, Harvard, and Berkeley).

At its core, MLPerf consists of multiple AI applications, for example, ResNet50 and BERT, that are evaluated in different domain settings like datacenter training, datacenter inference, and mobile devices inference. Every few months, MLPerf accepts result submissions (typically application run times) from companies and other institutions and publishes them for the given domain. That way, submitters can evaluate how well they are doing compared to the competition, and potential customers can examine which AI systems are better suited for their needs.

MLPerf has an ambitious mission statement: enhancing AI adoption by providing users the means to reason with performance based on real applications, open datasets, and easy-to-understand metrics. As AI hardware becomes more ubiquitous, we need more industry standards and metrics, and understanding how AI hardware performs will become even more imperative. While MLPerf is a great step in this direction, I can think of three refinements that could improve its impact.

I. Normalize

Reasoning with AI accelerator performance is not trivial; the AI accelerator landscape is diverse: needs vary significantly for different users (hobbyists, grad students, businesses, hyperscalars, etc.) and systems (mobile devices, workstations, autonomous cars, HPC/clouds, etc.) Furthermore, there are myriad ways to build an accelerator; you are given a computational domain and you can co-design hardware and software any way you want, you are only bouned by your imagination and creativity (well, that and market demands and the amount of workforce). It all boils down to many types of AI accelerators and systems, and it would be impractically hard to compare one acceleration technique to the other, and it gets even harder to define what would be a fair comparison for accelerators.

MLPerf Training Results vs. Moore’s Law (Source: IEEE Spectrum/MLCommons)

In an article published in IEEE spectrum, it was claimed that MLPerf submissions demonstrated how AI training performance was outpacing Moore’s Law. While Moore’s Law projects about a 2.2x increase in transistor density in 2.5 years, MLPerf training results improved at much higher rates of 16–27x during that time. One can draw an immediate conclusion here: “If we are beating Moore’s law by that much, most of the performance boost cannot come from better silicon, but from better acceleration and software techniques.” However, there are some caveats to this conclusion, and they downplay hardware’s contribution by not properly normalizing performance. Here’s why:

Step-by-Step Normalization of ResNet50 Training Results (NUMBER WERE UPDATED 10/2022)

Let’s dive into the numbers. Warning: If you find this paragraph too technical, see the above image as a tl;dr. Take datacenter training of ResNet50 as an example; the 2018 MLPerf baseline result was 6.2 minutes achieved using 640 NVIDIA V100 GPUs. The 2021 result was 0.23 minutes, so it was indeed 27x faster. But — it was achieved using 3456 TPU v4 accelerators, meaning with 5.4x more accelerators, so if we normalized it per chip, performance actually improved by 5x, which is still impressive. However, that’s not the whole story; a V100 GPU has 21.1 billion transistors, and while Google is not very revealing when it comes to TPU specifications, the transistor count of the TPUv4 is estimated to be around 31 Billion, so when we account for that, we are down to about 3.3x performance per transistor. Finally, V100 was fabricated using 12nm CMOS technology and TPUv4 is likely fabricated on 7nm CMOS (like the TPUv4i). According to the ITRS and other academic studies, 7nm circuitry should be about 1.5x faster than 12nm-based circuitry. So accounting for all these factors: bigger budgets, faster and better hardware capabilities, we are down to about 2.2x of nominal performance gain. (EDIT 10/2022 — this section was modified following the publishing of a better estimation of the TPUv4 specs.)

Obviously, this is a first-order approximation. Actual performance is driven by many factors other than computing silicon (memory bandwidth, caching, communication bandwidth and topology, etc.), more complex systems indeed need to be managed by a more elaborate and mature software stack, and having X more chips or transistors does not guarantee X more performance. But this is a mental exercise to understand normalization and specifically transistor-independent performance improvement trends.

Why is transistor-normalized performance so imperative? Because: 1. This is what accelerators are all about; if transistor technology stops improving (following the end of Moore’s law), the path forward would be getting more performance out of our non-improving silicon. 2. It does not make sense to use absolute numbers in the absence of a fair comparison — much like it would not make sense for two people to compete on the same race track when one is riding a bike and the other is driving a Lamborghini. 3. Chip performance cannot improve indefinitely. When accelerator vendors develop a line of accelerators, they need to evaluate their hardware-software stacks across generations and see the true contribution of newer optimizations, while being aware that at some point they will stop coming up with better solutions for the same problem, hit diminishing chip specialization returns and hit an ”accelerator wall”. 4. The practice of improving performance by throwing more money at a problem is not scalable. MLPerf will remain relevant only if it would stick to the acceleration vision that enabled the AI renaissance in the first place: Scale-to-Commodity.

AlexNet and Transformers opened the gate for complex model training using a single system, which led to a breakthrough. We need a greater emphasis on normalized performance metrics, e.g., performance per power (some effort is being done, but it does not seem to get the same traction because power measurements can be tricky, and are not as prestigious as a new training time world record) or performance per transistor.

Finally, while it is good to know the absolute performance numbers, does the average data scientist really care about how execution times improved on petascale systems with thousands of processors they cannot afford? This brings us to the next point in our discussion:

II. Democratize

The MLPerf application space is based on models from academic research papers published around 2012–2018, where GPUs were almost the only viable option for deep learning. Therefore, all these papers evaluate their performance on GPUs. As a result, the papers’ underlying architectural features, hyper-parameters, and architectural sizes were all tuned to GPUs, their merits are coupled with GPU’s architectural merits, and their results are bound by GPU’s limitations. The outcome of such practice is that MLPerf essentially becomes a “porting from GPU” competition that gives GPUs an advantage and perpetuates them as a synonym for AI hardware. Don’t get me wrong — we should all cherish GPUs for being the gate-openers of modern AI since without their mature hardware/software stack we would not have witnessed the rise of AI applications; however, now that there are other solutions out there, we should also explore the properties of models natively developed on other types of architectures, since they might have other limitations and can open new domains. Furthermore, we might not want a single software implementation, or a single type of architecture, dictating the gains and limits of the entire AI application space. It would be fair to assume that if, for example, TPUs and the TensorFlow ecosystem would have existed in 2010, the MLPerf application space would have looked different.

III. Generalize

“When a measure becomes a target, it ceases to be a good measure”
Charles Goodhart

The final challenge that MLPerf needs to overcome is the lack of application generalization. About a year ago, I experimented with the performance measurements of one of the leading AI hardware vendors, and based on my experience, getting the expected performance was not trivial. Even after installing all the latest libraries on a fresh Docker image, the performance of the MLPerf model was not close to the published numbers. It was only possible to achieve the MLPerf performance after I used the vendor’s version of the model from a dedicated repo and that also included manual configuration files that override the compiler’s decisions for that model. Finally, when I used the exact same configuration but changed the model to a different model that has almost the exact same layers and sizes, performance dropped by about 50%. I have heard of similar experiences from other folks as well (on other models and other vendors).

This behavior is a natural and unfortunate outcome of benchmarking “slow-moving targets”. Currently, MLPerf uses the same set of applications for over three years. As a result, vendors get better performance numbers by exhibiting these unrealistic “one-offs” for these applications; most data scientists do not have a profound understanding of the underlying hardware, compiler, and software libraries; they do not spend weeks on performance tuning, let alone create manual configuration files that override the compiler’s decisions. In my first blog post, I mentioned that one of the main challenges of AI is reaching “user-to-hardware-expressiveness.” AI is a fast-moving field with hundreds of new research papers published every day, and instead of focusing on decade-old applications and allowing hacks like model-specific compilation files, we need to generalize our stacks so they would run more cases with good performance “out-of-the-box.” MLPerf should put more emphasis on what data scientists need, which is a robust acceleration stack that utilizes the underlying hardware for never-before-seen workloads. Currently, most AI hardware vendors such as accelerator startups, prioritize customer models and needs over MLPerf applications, and if MLPerf does not approach that by taking a more generalized and democratized path, it would end up as a competition between 3 to 4 companies and lose its appeal as an industry-wide benchmark.

Rethinking Hardware Acceleration for a Better AI Future

Since hardware acceleration has been one of the key enablers of the AI field it is important to understand its underlying assumptions, the current conceptual limitations, and what new directions it should take in order to drive the AI field further.

Crossing the Academia-Industry Innovation Gap

AI is a very research-driven field that enjoys a tight symbiotic relationship between industry and academia. The most cutting-edge AI models used in production systems and the foundations for such models were all originated in papers published in academic venues such as NeurIPS, ICML, CVPR, and ICLR. Notable examples are AlexNet, ResNet, YOLO, Transformers, BERT, GPT-3, GANs, and Vision Transformers. Furthermore, many of those papers were the outcome of a collaboration between academic institutions and research labs in companies like Google, Facebook, NVIDIA, and Microsoft.

Number of AI Accelerators Papers Published in Top Computer Architecture Venues (Stats from Biren Research)

In contrast to the AI application space, in the AI accelerators world industry has been lagging behind academia. While dozens of new accelerator papers are presented each year in the top-tier architecture venues like ISCA, MICRO, HPCA, and ASPLOS, the core architectural ideas of accelerators deployed in production are decades old (processing-in-memory, data flow computing, systolic arrays, VLIW, and others). We need more ideas and innovation to “trickle-down” and get materialized into new products. The reason for the slow adoption of new ideas is that chips take years to manufacture, and since the AI hardware field is very competitive, vendors find daring new ideas too risky, as they might come with the cost of shaking things up for their current features. However, now that the field has slightly matured, the foundations have already been laid out. We already know what works. Therefore, the time has come for vendors to try newer ideas and disrupt the field with exciting new discoveries. There’s still headroom for innovation!

Make AI Acceleration REALLY About AI

Modern AI has intrigued our minds; it got us thinking about the endless possibilities for the future, and it raised philosophical, sociological, and ethical questions, which lead to a very vibrant discussion between the titans of the field on what AI can do and which directions it is going. But before we can imagine what’s possible, we must be mindful of what I see as a “chicken-and-egg” problem for AI: because hardware enabled the models we have today, it is hard to think of new classes models for which existing hardware performs poorly at, so you need new classes of hardware for these new classes of models. Since creating new hardware is very expensive you need to have the financial justification to embark on this new exploratory journey, but you cannot demonstrate the value of these new models if you don’t have the hardware to run them within reasonable times, right? Without the practical capabilities of showing new models, AI innovation is shackled by the constraints set by existing hardware solutions.

AI to Deep Learning: What Are We Accelerating? (Original Diagram: “Deep Learning Book”, Goodfellow et al.)

In this context, it is worth mentioning that while we refer to what our current hardware does as “AI acceleration” it is actually an overclaim. In reality, it’s closer to “deep learning acceleration” or some intersection of tensor-based deep learning and machine learning algorithms that run well on matrix multiplication engines. Some might say: “well, we mostly hear about deep learning because that’s is the ‘killer app’ that delivers the most value nowadays,” but that’s exactly the point — we mainly explore deep learning because our libraries and hardware are tuned for deep-learning and therefore, deep-learning is what we are currently able to explore. This is one of the caveats for developing accelerated hardware; we have evolved from building general-purpose processors to a variety of architectures by co-designing algorithms and hardware; however, we optimized our datapaths to fixed-size variables, since we have fixed-size arithmetical engines and fixed-word size memories and interconnections, and we try to use matrix multiplication as much as possible because it has simple and predictable control flow. We do not want the performance of our overly-tuned hardware to drop off a cliff and hinder our algorithmic progress by blocking the ability to explore new capabilities.

Although I am not a neuroscientist, I know the brain has many complex structures communicating in an irregular fashion, and we probably cannot approximate all of them using matrix multiplications, fixed-width datapaths, and non-linear functions. So we need to think about whether this hardware evolution made us converge to a local optimum in the application-hardware co-design space? Maybe if we digress from the tensor-intensive model, we can create “AI acceleration 2.0”, the next wave of AI hardware, which does not multiply matrices as well but supports irregular computation patterns and has flexible representations and arbitrarily parallelizable compute engines. An accelerator (or heterogeneous integration of such accelerators) with these properties would span a much wider application scope that is closer to true artificial intelligence.

To summarize, we need to ask: is attention all we need? Are CNNs? like it or not, they do consist of most of what we got, but maybe (probably) there’s more to AI that we can scale to commodity.