AI Accelerators — Part I: Intro

Adi Fuchs
7 min readDec 5, 2021

The modern world as we know it is undergoing a revolution. Never before has the human experience been so tightly coupled with technology. Sure — technological advancements have always made our lives more comfortable, but the prevalence of modern artificial intelligence made the relations between humanity and technology more bi-lateral and more immediate. In the “old world”, tech companies improved their products by observing users' behavior, studying market trends, and refining their product lines in a cycle that typically takes months, or even years. Nowadays, artificial intelligence has paved the way for self-improving algorithms that drive the human-machine feedback without any manual human intervention: better human experience rewards better technological solutions which in turn evolve to provide better human experiences. It is all done at the scale of millions (or even billions) of users and dramatically shortens the product refinement cycle. The success of artificial intelligence is attributed to three important trends: (i) novel research projects driving new algorithms and applicable use cases, (ii) the ability to have centralized entities (e.g., cloud services) that collect, organize, and analyze an abundance of user data, and: (iii) novel computing infrastructure capable of processing large amounts of data at massive scales and/or with fast turnaround times.

In this series, I focus on the third trend, and specifically, I will give a high-level overview of accelerators for artificial intelligence applications — what they are, and how they became so popular. As discussed in later posts, accelerators stem from a broader concept rather than just a particular type of system or implementation. They are also not purely hardware-driven, and in fact — much of the AI accelerator industry’s focus has been around building robust and sophisticated software libraries and compiler toolchains.

Le Penseur (“The Thinker”) by Auguste Rodin (source: Musee Rodin)

This series is intended for curious readers that want to know how computer architecture principles drive artificial intelligence, how processors became a crucial part of today’s technology, and what ideas are being implemented by some of the world’s leading AI companies. It does not require an in-depth background in computer architecture, and it should be understandable to people that have a good grasp and intuition of software engineering, high-level programming principles, and how a computer system is built. People with a deeper hardware background can also benefit from reading this as a “back to basics” refresher that demonstrates how fundamental ideas have culminated to drive multi-billion dollar industries. So — let’s begin.

AI is Not All About Software and Algorithms

The foundations of AI/ Machine Learning/Deep Learning have been here for a while, with ideas dating back to more than 50 years ago. However, they were mostly popularized only in the last decade and are now basically everywhere. Well — why is that? Why now? What happened?

The view held by many is that the Deep learning renaissance started back in 2012 with a paper presenting a deep neural network known as “AlexNet”, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton from the University of Toronto. AlexNet won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC). In this competition, teams needed to perform a task called object region; that is, given an image showing an object and a list of object classes like “airplane”, or “bottle”, or “cat” (of course), each team’s implementation needed to identify to which of the different classes the object in the image belongs.

AlexNet’s performance was revolutionary. It was the first time that the winning team used a class of deep learning architectures called: Convolutional Neural Networks (CNNs). AlexNet performed so well that following years’ ImageNet winners all used CNNs as well. It was a pivotal moment for computer vision, which sparked the interest in applying deep learning in other domains such as natural language processing, robotics, and recommendation systems.

ImageNet Winners’ Classification Error Over the Years — Lower is Better (Source: CS231n, Stanford University)

Interestingly, AlexNet’s fundamental structures were not significantly different from those of existing CNN architectures like LeNet-5, suggested by Yann LeCun et al. in 1998. All is, of course, not to diminish the novelty of AlexNet, but it does make one wonder: “if CNNs were already there, what other factors contributed to AlexNet’s huge success?” While the authors did employ some new algorithmic techniques, their abstract hinted at some interesting novelty:

“To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation.”

As it turns out, the AlexNet authors spent a fair amount of time mapping the time-consuming convolutional operations to a GPU (or: “graphics processing unit“), which, compared to the standard processors, is capable of faster execution of specific domains such as computer graphics and linear algebra-based computation (which CNNs are abundant of). Their efficient GPU implementation allowed them to shorten the overall time it takes to “train” the network (i.e., the process in which the network attempts to “learn” from existing examples of labeled data). They also elaborate on how to map their network to multiple GPUs, thus enabling the deployment of a deeper and wider network and train at a fast rate.

Using the seminal AlexNet paper as a case study, we can get a hint of the answer to the questions of the opening paragraph; while algorithmic progress is important, it was the use of specialized GPU hardware that enabled our ability to learn more complex relations (deeper and larger networks = more variables used for prediction) at reasonable times, thus improving the overall network’s accuracy over the current state-of-the-art. Without the necessary computing capabilities to process all of our data within reasonable timeframes, we would not have witnessed the widespread adoption of deep learning applications.

If I am an AI Practitioner, Should I Care About Processors Now?

As an AI practitioner, you want to focus on exploring new models and ideas and would not want to worry too much about seemingly irrelevant problems like how your hardware behaves. Therefore, while the ideal answer would be “no, you don’t need to know about processors” currently the answer is “well — probably yes”; you would be surprised by how much familiarizing with the underlying hardware and how to debug performance could change the run times of your inference and training applications.

Speedup of Various Parallelization (and Other) Techniques for Matrix Multiplication (source: D.A. Patterson J.L Hennessy CACM 2019)

Depending on what’s going on under the hood, you could be 2 or 3 times slower, sometimes even an order of magnitude slower (yes, instead of a few hours, you could end up running a job for days). Simply changing the way you do matrix multiplication can turn to be a huge performance win (or a loss). Having suboptimal performance could affect your productiveness, the amount of data you can process, and essentially kill your AI cycle. For a business doing AI at scale, it amounts to millions of dollars lost. So why is an optimal performance not guaranteed? Because we have not yet effectively reached a reasonable “user-to-hardware expressiveness” (my term, in lack of a better one). There are many applications and many patterns, and while there is some critical mass of use cases that work well, we still have not been able to generalize and get ideal performance “out of the box” (meaning, automatically getting the best out of your hardware for a brand new AI model you’ve just written, without any manual tweaking of the compiler or software stack).

AI User-to-Hardware Expressiveness (images credit: Shutterstock)

The above diagram illustrates the main challenge of “user-to-hardware expressiveness“; we need to accurately depict what the user wants and translate it to a language the hardware layer (processors, GPU, memory, network, etc.) understands. The main problem is that, while the left arrow (programming frameworks) is mostly user-facing, the right arrow that takes your programming code and transforms it into machine code is not. Thus we need to rely on smart compilers, libraries, and interpreters to seamlessly transform (or: “lower”) your high-level code into machine representation that is not only working but is also performant.

The reasons why bridging this semantic gap is difficult are twofold: (i) There are abundant ways to express a complex computation in hardware. You need to know the number of available processing elements (e.g., GPU processing cores), the amount of memory your program needs, the types of memory access patterns and data reuse your program exhibits, and the relations between different parts in your computation graph; anyone of the above might stress different parts of your system in unexpected ways. To overcome that, we need to establish an understanding of how all the different layers of your hardware/software stack interact. While you can get good performance for many common scenarios, there’s practically an endless “tail” of corner cases, which you might exhibit really bad performance. (ii) While in the compute world software is slow and hardware is fast, the deployment world acts in an opposite fashion: the deep learning landscape is rapidly changing; new ideas and software updates are released on a daily basis, but it takes more than two years to architect, design, and “tape out” (the term for manufacturing a new chip) a high-end processor. In the meanwhile, the targeted software might have already significantly changed, so we might find that the novel ideas and design assumptions processor engineers had two years ago are rendered obsolete.

Therefore it is still upon you, the user, to explore proper ways to identify the bottlenecks that make your runs take a long time. And for that, well — yes, you need to know about processors, and in specific, about modern AI accelerators, and how they interact with your AI programs.

Next Chapter: Transistors and Pizza (or: Why do we need Accelerators?)

About me