AI Accelerators — Part II: Transistors and Pizza (or: Why Do We Need Accelerators)?

Adi Fuchs
14 min readDec 5, 2021

--

We arrived at the key motivator of the entire series; a fundamental question often asked by venture capitalists when being pitched for a new startup or executives when being pitched for a new project: “why now?”

To answer that, we will have a crash course on the history of processors and what significant changes the industry underwent in recent years.

What is a Processor?

Simplistically speaking, the processor is the part of the computer system in charge of the actual computing of numbers. It receives user input data (represented as numerics) and generates new data per the user’s request, i.e., as reflected by the set of arithmetic operations the user wishes to perform. The processor employs its arithmetic units to generate the computation result, which means running the program.

Processors were commoditized in personal computers in the 1980s. They gradually became an integral part of our daily lives in laptops, mobile phones, and the global infrastructure compute fabrics connecting billions of users in clouds and datacenters. With the increase in popularity of complex compute-hungry applications, and abundance of new user data, modern systems must serve an ever-growing demand for processing capabilities. We are always in need of better processors, which historically meant faster processors (the computation takes less time to complete), but now it can also mean more efficient processors, meaning your computation can take the same time, but it will use less energy and save battery, cut down electricity costs, and reduce the carbon footprint of warehouse-scale computers.

Processor Evolution: Exponential Scaling and Power Limitations

The evolution of computer systems is one of humanity’s most outstanding engineering achievements. It took us about 50 years, almost instantaneous on a historical scale, to get to a place where the average smartphone in one’s pocket houses a million times more computation power than the room-sized computers that landed a person on the moon in the Apollo mission. The key for this evolution lies in the semiconductor industry and how it improved processors’ speed, power, and costs.

Intel 4004 — the First Commercially Available Microprocessor, Released in 1971 (Source: Intel)

Processors are made of electrical elements called: “transistors.” Transistors are logical switches used as the building blocks for everything from primitive logic functions (e.g., AND, OR, NOT) to complex arithmetic (floating-point addition, sine functions) and memories (e.g., ROMs / DRAMs), and they have been continuously shrinking over the years. Even if you’re not a processor enthusiast or regularly read transistor datasheets, you’ve probably heard of “Moore’s law” named after Intel’s co-founder, Gordon Moore.

The Number of Transistors per Chip, on a Logarithmic Scale (Source: Gordon Moore / ACM)

In 1965, Gordon Moore observed that the number of transistors that fit into an integrated circuit doubled every year (later updated to be every 18–24 months). He projected that this trend would continue for at least a decade. While one can argue that it is not much a “law” but more of an “industry trend,” it did last for about 50 years as one of the longest-lasting man-made trends in history.

Electrical Properties of Transistor Scaling (source: Robert Dennard IEEE)

But there is another law. A law that is not as famous but is equally important. It is called: “Dennard scaling,” and it was formulated by Robert Dennard in 1974. While Moore’s law projected that transistors would shrink over the years, Dennard asked: “aside from being able to fit more transistors on a single chip, what are the actual benefits of having smaller transistors?” Dennard’s observation was that when scaling down a transistor by a factor of “k” (say k =square root of 2 every two years), the electrical current decreases. Furthermore, because electrons are traveling a smaller distance, we end up with a transistor that is k times faster, and most importantly — its power goes down by . So in total, we can fit more transistors, and our logical functions will run roughly k times faster, but our chip’s power consumption would not increase.

Processor Evolution - Phase I: The Frequency Era (1970–2000s)

Evolution of Microprocessor Frequency Rates (source: My Ph.D. dissertation, Based on cpudb)

In the early years, the microprocessor industry focused mainly on central processing units (“CPUs”) since they were the main workhorse of computer systems those days. Microprocessor vendors exploited the scaling laws to their full extent. Specifically, they aimed at increasing the CPU frequency, simply since having faster transistors enabled the processor to perform the same computation at a higher rate (higher frequency = more computations per second, potentially). It’s a somewhat simplistic way of looking at things; lots of architectural innovation went into processors, but ultimately, the frequency was a great contributor to performance in the early years. From 0.5MHz in Intel 4004, 50MHz in 486, 500MHz in Pentiums to around 3–4GHz in the Pentium 4 series.

Power Density of Microprocessors In the 1970–2000 and Projections Beyond the Year 2000 (source: Intel)

Around 2000, Dennard scaling started breaking down. Specifically, the supply voltage stopped dropping at the same rate and as frequency increased, and so the power density rate increased. If this trend were to continue, chips would become unsustainably hot and require strong cooling solutions that either do not exist or were too expensive to be commoditized. Therefore, vendors could not rely on increasing CPU frequencies to get more performance and needed to come up with something else.

Processor Evolution — Phase II: The Multicore Era (2000s–mid-2010s)

The implication of stagnating CPU frequencies means it became significantly harder to speed up a single application, written as a sequential stream of instructions. But, as Moore’s law suggests, we can still get twice as many transistors in our chip every 18 months or so. Therefore, instead of speeding up a single processor, the solution was to divide the chip into multiple identical processing cores, with each core executing its own stream of instructions.

Evolution of CPU and GPU Core Count (source: My Ph.D. dissertation, Based on TechPowerUp)

For a CPU, it is natural to have multiple cores since it is already concurrently executing multiple independent tasks like your internet browser, word processor, and sound player (more accurately, the operating system does a great job of creating the abstraction of this concurrent execution). Therefore, an application can run on one core while another application runs on another core. With this practice, a multicore chip can execute more tasks in a given time. However, to speed up a single program, the programmer would need to parallelize it, which means breaking down the original program’s stream of instructions into multiple “sub-streams” of instructions or “threads.” Simplistically speaking, a group of threads can run concurrently on multiple cores, at any order, without any thread interfering with another thread’s execution. Such practice is called “multi-threaded programming.” It is the most prevalent way for a single program to gain performance from multicore execution.

A common form of multicore execution is in Graphics Processing Units, or “GPUs.” While CPUs consist of a small number of fast and sophisticated cores, GPUs rely on a large number of simpler cores. Historically, GPUs focused on graphic applications, since graphic images (for example in a video) consist of thousands of pixels, which can be processed independently with a series of simple and predetermined computations. Conceptually, each pixel can be assigned a thread and execute a simple “mini-program” that computes its behavior (e.g., color and brightness levels). The high degree of pixel-level parallelism makes it natural to exploit thousands of processing cores. So in the next wave of processor evolution, instead of speeding up a single task, CPU and GPU vendors leveraged Moore’s law to increase the number of cores since they were still able to get and use more transistors on a single chip.

Utilization Wall Problem and Projections of “Dark Silicon” (source: Bespoke Processor Group, UW and Alternative Computing Technologies Lab, UCSD)

Unfortunately, things became even more complex as in the late 2000s Dennard’s scaling reached the end of the road. The main reason was that the transistor’s supply voltage was nearing the physical limits and could not be scaled down at all. While previously it was possible to increase the number of transistors while keeping the same power budgets, twice as many transistors meant roughly twice as much power; therefore, while physical dimensions were scaling, power was not. The demise of Dennard scaling meant modern chips will hit a “utilization wall”. At this point, it doesn’t matter how many transistors our chip has — as long as there’s a power constraint (limited by our ability to cool down the chip), we will not be able to utilize more than a given fraction of the chip’s transistors (and consequently, cores). The remaining parts of the chip must be powered down, a phenomenon that is also known as “dark silicon.”

Processor Evolution — Phase III: The Accelerator Era (2010s-?)

Dark silicon was essentially the grand preview of “the end of Moore’s law” — and times became challenging for processor manufacturers. On one hand, compute demands were skyrocketing: smartphones became ubiquitous and crammed copious amounts of computing power, cloud servers needed to handle growing numbers of customers and services, and “worst of all” — AI has been (re-)discovered and it has been gobbling up compute resources at staggering rates. On the other side, in unfortunate timing, dark silicon became a limiter for what can be achieved by transistor-based chips. So now, when we need to improve our processing capabilities more than ever, it has gotten much harder than ever before.

Compute Needed to Train State-of-The-Art AI Models (source: OpenAI)

Since newer chip generations became bounded by dark silicon, the computer industry started gravitating towards hardware accelerators. The accelerator premise is the following — if we’re not going to be getting more transistors, let’s make better use of the ones we still got; and how can we do that? by specialization. Traditional CPUs were designed to be general-purpose; they employ the same hardware structures to run the code of all of our applications: operating system, word processor, calculator, internet browser, email clients, media players, and so on. These hardware structures need to support a large number of logical operations and capture many possible patterns and program-induced behaviors. It all amounts to hardware that is great for usability but is fairly inefficient. If we focus only on some applications we would confine our problem domain, and remove a lot of structural redundancy from our chips.

General Purpose CPUs vs. Application-Specific Accelerators — Qualitative View: You Can Be Very Efficient on Certain Applications, or You Can have Reasonable Efficiency in the Entire App Range (source: L. Yavitz et al.)

Accelerators are chips specialized in specific applications or domains, meaning, they will not run all applications (you don’t want them running your operating system, for example), but instead they will be designed at a hardware level, to run a narrow spectrum very efficiently, since: (a) their hardware is structured to cater only the mission-specific operations, and: (b) the interface between hardware and software is simpler. Specifically, because an accelerator operates within a given domain, the accelerator program’s code should be more compact as it encodes fewer data.

The Pizza Analogy: Think of it this way, you are given a fixed electricity budget that can power 1000 square feet of real estate, and you want to build your restaurant. You need to decide how your menu is going to look, and you have two options. Option a: you could decide that you are going to serve lots of different dishes: pizzas, vegan, burgers, sushi, and so on. Option b: you are going to specialize in certain dishes, say pizza. What are the tradeoffs?

If you are going for option a, you cater to many types of cuisines and a variety of potential diners. That has multiple downsides: your chef needs to make many types of dishes, but likely won’t excel at all of them. Furthermore, you might need multiple refrigerators and pantries for the different ingredients, you need to keep tabs on which ingredients you ran out of and which went bad, as they have different expiration dates. Your kitchen table becomes packed with different types of groceries, and things get mixed as some dishes take more time to make than other dishes. All in all, you give a decent experience for many types of customers, at the cost of a lot of bookkeeping and overhead. Conversely, in option b, you can have a chef that is a world expert in pizzas, you keep track of just a handful of ingredients, you can use a specially-ordered Italian oven for pizzas. Your kitchen is efficient and can be ordered to stations on different tables: one table for dough-making, one table for sauce and cheeses, one for toppings, all in all pretty neat. With the same 1000 square feet, you will give a great experience for pizza lovers, serving high volumes of top-notch pizzas. By specializing you lose a lot of the inefficiencies of a restaurant offering a wide range of dishes, but on the flip side, it could be a risky gamble: what if tomorrow there’s no demand for pizzas at all (improbable, but again — what if?) or what if there’s a demand only of pizzas with a dough that your specially-ordered oven cannot make? You are already invested in specializing your kitchen but you could go out of business unless you pivot. It would cost a lot of time and money, and by the time you have done reorganizing, you find out that your customers have changed their preferences again.

Back to the Processor World: In our analogy, CPUs are option a and very specialized domain-specific accelerators are option b. The 1000 square feet is your silicon budget; how will you design your chip? Obviously, the reality is not as polarized by these two extremes, there’s a spectrum of specialization in which one trades generality for efficiency to various degrees. Ealy hardware accelerators were designed for a few specific domains such as digital signal processing, network processing, or as auxiliary co-processors to the main CPU.

The first shift from CPUs into a major accelerated application space was GPUs. A CPU harbors a handful of complex processing cores each employing various tricks like branch predictors and out-of-order execution engines to speed up a single-threaded job as much as possible. GPUs are structured differently; a GPU consists of many simple cores which have a simple control flow and run simple programs. Originally, GPUs were used for graphic applications such as computer games since they contain images with thousands or millions of pixels, each can be independently computed in parallel (Here’s a visual example of the difference between a GPU and a CPU). A GPU program usually consists of a handful of core functions, called “kernels”. Each kernel contains a series of simple computations, and it is executed thousands of times on different data portions (e.g., a pixel or a patch containing several pixels). These properties make graphic applications a target for hardware acceleration. They have simple behavior, so no need for complex instruction control flow in the form of branch predictors, there’s just a handful of operations so no need for complex arithmetical units (like a one that computes sine functions or does 64-bit floating-point division). It later turned out that since these properties apply not just to graphic applications, GPUs’ applicability can be extended to other domains like linear algebra or scientific applications. Nowadays, accelerated computing goes beyond just GPUs, and there’s a wide spectrum between CPUs that are fully programmable but inefficient to application-specific integrated circuits (ASICs) that are highly efficient but are rigid and have limited programmability.

Processing Alternatives for Deep Neural Networks (Source: Microsoft)

Nowadays, accelerators are gaining a lot of traction, as more and more applications that exhibit “good” properties become targets for accelerations: video codecs, database processors, cryptocurrency miners, molecular dynamics, and of course, AI.

What Makes AI Such a Good Target for Acceleration?

Financial Viability: It costs a lot to design a new chip — you need to hire domain specialists, use expensive CAD tools for chip design and verification, develop prototypes, and manufacture silicon. It all amounts to dozens of millions of dollars if you use cutting-edge silicon technology (e.g., 5nm CMOS nowadays). Luckily, for AI, financial viability is not a problem; the potential benefits of AI are huge, AI platforms are expected to generate trillions of dollars in the near future. If your idea is good enough you should be able to fund this endeavor pretty easily.

AI is an “Acceleratable” Application Domain: AI programs have all the properties that make them suitable for hardware acceleration. First and foremost, they are massively parallel: the majority of the compute time is spent on tensor operations like convolutions or self-attention operators. If possible, one can also increase the batch size, so the hardware would process several samples at a time, to improve hardware utilization and drive parallelism even further. The main factor from which hardware processors drive their ability to run things fast is parallel computation. Second, AI computations are confined to a handful of operations: multiplication and addition for the dominating linear algebra kernels, some non-linear operators, e.g., ReLU to mimic synaptic activation, and exponential operations for softmax-based classification. Having a narrow problem space enables us to simplify our arithmetic hardware, focusing on certain operators. Finally, since an AI program can be represented as a computation graph, it is possible to know the control flow at compile-time, much like a for-loop with a known number of iterations, the communication and data-reuse patterns are also fairly confined, thus it is possible to characterize which network topologies we would need to communicate data between different compute units and software-defined scratchpad memories to control how data is stored and orchestrated.

AI Software is Structured in a Hardware-Friendly Fashion: Not so long ago, if you wanted to innovate in the compute architecture field you would have been tempted to say: “I have an idea for a new architectural improvement, that could speed up things significantly, but — all I need to do is change the programming interface a little bit and have the programmer use this feature.” That would kill the idea then and there. The programmer’s API was not to be touched, and it is hard to justify burdening the programmer with low-level details that break the “clean” sematic flow of the program. Furthermore, it’s not good practice to mix underlying architectural details with programmer-facing code. a. It would not be portable, as some architectural features change between chip generations, and: b. It might be programmed incorrectly because most programmers don’t have a profound understanding of the underlying hardware. While you can say that GPU and multicore CPUs have already digressed from the traditional programming model by relying on threads (and sometimes, god forbid — memory fences), we resorted to multi-threaded programming since that was our only option once single-threaded performance was no longer exponentially improving. Multi-threaded programming is still hard to master and requires a lot of educating. Luckily, when people write AI programs they structure computation graphs using neural layers and other, well-defined blocks. The high-level programmer code (for example in TensorFlow or PyTorch) is already written in a way where you can stamp out parallel blocks and construct a dataflow graph. Therefore, in theory, you could build rich software libraries and a compiler toolchain elaborate enough to understand the program’s semantics and lower it efficiently into a hardware representation without any involvement from the application programmer, let the data scientists do their thing, they could care less on which hardware they run. In practice, it still takes time for compilers to mature to a full degree.

Because There’s Practically no Other Choice: AI is everywhere, in large data centers, smartphones, smart sensors, robots, and autonomous cars. Each system has different real-world constraints: you wouldn’t want an autonomous car hitting an object on the road just because its hardware was too slow to detect it, you wouldn’t want to waste days and thousands (or even millions) of dollars on electricity just because you trained large language models in your data center on slow and inefficient hardware. There can not be a “one size fits all” approach when it comes to AI hardware. The computing demands are huge, and every bit of efficiency amounts to a lot of time, energy, and cost spent. Without proper accelerated hardware to cater to your AI needs, your ability to experiment with AI and have discoveries will become limited.

Next Chapter: Architectural Foundations
Previous Chapter: Intro

About Me

--

--