AI Accelerators — Part III: Architectural Foundations

11 min readDec 5, 2021

Many different companies build AI accelerators. The landscape is filled with an abundance of ideas and assumptions based on the targeted problem domains that span the AI application space and the best solutions to tackle them.
Initially, I started writing this chapter and the next chapter as a single chapter that outlines some of the solutions that exist in the AI accelerator landscape and provides the background of the main ideas behind them. That came out a pretty long chapter, and I was advised to split it up by several people. Feel free to skip to the next chapter if you have a solid background in processor architectures. Alternatively, if you are interested in having a deeper understanding of the concepts described here — take a computer architecture class or read some books. This post would not replace them.

Instruction Set Architecture — ISA

An ISA describes how instructions and operations are encoded by the compiler and are later decoded and executed by the processor. It is the programmer-facing part of the processor architecture. Common examples are Intel’s x86, ARM, IBM Power, MIPS, and RISC-V. An ISA can be thought of as a vocabulary of all the operations supported by the processor. Usually, it consists of arithmetic instructions (like “add” or “multiply”), memory operations (“load” or “store”), and control operations (for example, “branches” that are used in if-statements). Another way to think about it is that the ISA is the hardware equivalent of an application programming interface (API), which describes the interface of functions in a program.

*VRSQRT28PD Description. (Source:* *Intel 64 and IA-32 Architectures Software Developer’s Manual*)

CPU ISAs have been classified as Reduced Instruction Set Computing (RISC) or Complex Instruction Set Computing (CISC). A RISC ISA consists of simple instructions. The ISA supports a small number of simple operations (Add, Multiply, etc.). All instructions are of the same bit-length (for example, 32 bits); therefore, a hardware decoder for RISC instructions is considered simple. Conversely, in a CISC ISA, different instructions can have different lengths. A single instruction can describe a complex combination of operations and conditions like: “VRSQRT28PD Approximation to the Reciprocal Square Root of Packed Double-Precision Floating-Point Values with Less Than 2^-28 Relative Error” out of Intel’s x86 ISA. Typically, CISC programs have smaller code footprints, i.e., the amount of memory needed to store the program’s instructions, than its equivalent RISC program. This is because i. a single CISC instruction can span multiple RISC instructions, and: ii. Varied length CISC instructions are encoded such that the smallest number of bits represents the most common instruction. However, to benefit from the complex instructions, the compiler needs to be sophisticated enough to identify the parts of the program that can be mapped to them.

Computing Market Projection of the Diminishing Fraction of x86 (Orange) compared to ARM (Purple) — (Source: AMD/ExtremeTech)

Back in the 1980s, 1990s, and early 2000s, there was the “RISC vs. CISC war” with x86-based Intel and AMD leading the CISC side and ARM’s ISA leading “camp RISC”. There are pros and cons for each approach, but ultimately and greatly due to the booming of ARM-based smartphones, RISC came at the upper hand for mobile devices. It is now also becoming more dominant in the cloud with the designs like Amazon’s ARM-based line of AWS Graviton processors.

Domain-Specific ISAs

In the context of accelerators, it is worth noting that both RISC and CISC are general-purpose instruction set architectures that are used to build general-purpose processors. With that said — in the context of accelerators, one of the main lessons learned from the CISC vs. RISC war is that simplicity and specifically simplifying the instruction decoding significantly contributes to hardware efficiency; hence RISC has been more favorable (at least for smartphones).

Energy Spent performing an ADD Instruction in a 45nm CMOS processor (source: M.Horowitz ISSCC 2014)

It became a common practice for accelerator companies to employ domain-specific ISAs, which take the idea of RISC to the next step. Given an existing reduced instruction set architecture (and potentially processing cores), it is possible to reduce it even further by supporting only the subset of instructions needed in the targeted application domain, in our context — AI. Domain-specific ISAs further simplify the processing cores and hardware/software interface to achieve an efficient accelerator design. In AI applications, which typically consist of linear algebra and non-linear activation, there is no need for many “exotic” types of operations. Therefore, the ISA can be designed to support a relatively narrow operation scope.

The benefits of using a reduced version of an existing RISC ISA is that some RISC companies (like ARM or SiFive) sell IPs which are existing processing cores that support the ISA (or some subset of it), which can be used as a baseline for the customized processing core that will be used in the accelerator chip. This way, the accelerator vendors can rely on a baseline design that has already been verified and potentially deployed in other systems; this is a more solid alternative to designing a new architecture from scratch, and it is particularly appealing for startups that have limited engineering resources, want to have the support of an existing processing ecosystem, or want to shorten ramp-up times.

Very-Long Instruction Word (VLIW) Architectures

Much like domain-specific ISAs can be thought of as an extension of the RISC idea (simpler instructions with less supported operations), some architectures extend CISC’s notion of combining multiple operations into a single complex instruction. These architectures are known as Very-Long Instruction Word (VLIW), and they were introduced in the early 1980s.

VLIW architectures consist of a heterogenous datapath array of arithmetic and memory units. The heterogeneity stems from the differences in timing and supported functionality of each unit: for example while computing the outcome of a simple logical operand can take 1–2 cycles, a memory operand can take hundreds of cycles.

Block Diagram of a Simple VLIW Datapath (Source: Princeton University)

VLIW architectures rely on a compiler that combines multiple operations to a single and complex instruction that dispatches data to the units in the datapath array. For example, in AI accelerators, an instruction could point a tensor to a matrix multiply unit, and in parallel, it can send data portions to a vector unit and a transpose unit, and so on.

The upside of VLIW architectures is that potentially, the cost of orchestrating the processor datapath via instructions is significantly reduced; the downside is that we need to guarantee that the workload is getting balanced between the various units in the datapath to avoid having underutilized resources. Therefore, to achieve a performant execution, the compiler needs to be able to do complex static scheduling. More concretely put, the compiler needs to analyze the program, assign data to units, know how to time the different datapath resources, and break the code into individual instructions in a way that utilizes the most units at a given time. Bottom line, the compiler needs to be aware of the different datapath structures and their timings and solve computationally complex problems to extract high instruction-level parallelism (ILP) and result in a performant execution.

Systolic Arrays

The systolic array was introduced by H. T. Kung and C. E. Leiserson in 1978, in which multiple processing elements (PEs) are hardwired in a fixed order. The entire array operates in “beats”; each PE processes a portion of the data in every compute cycle and communicates it to the next interconnected PE.

Matrix multiple example via a 4x4 Systolic Mesh (source: NJIT)

The systolic structure is an efficient way of performing matrix multiplications (which DNN workloads have an abundance of); the partial multiplications and accumulations are performed in parallel and a pre-determined orderly fashion. The TPU was the first widespread use of systolic arrays for AI. Consequently, several other companies have integrated systolic execution units in their accelerated hardware, like NVIDIA’s Tensor Cores.

Reconfigurable Processors

The typical flow of CPUs, GPUs, and some accelerators relies on a pre-determined number of arithmetic units and runtime behavior determined at runtime based on the executed program’s instructions. However, there are other classes of processors that are called “reconfigurable processors.”

Basic FPGA Architecture (Source: Xilinx)

Reconfigurable processors consist of replicated arrays containing interconnected compute units, memory units, and a control plane that orchestrates how the data traverses and gets manipulated between the various units as the targeted program executes on the chip. In order to run a program, a special-purpose compiler constructs a configuration file that contains control bits that set the behavior of each element in the array. The most common class of reconfigurable processors is the Field-Programmable Gate Array (FPGA). FPGAs support a wide computational spectrum by enabling bit-level configurability: the arithmetic units can be configured to implement functions that operate on numbers of arbitrary widths, and the on-chip memory blocks can be fused to construct memory spaces of varied sizes. An upside of reconfigurable processors is that they can model chip designs written in hardware description languages (HDLs); this enabled companies ability to test their designs within a few hours instead of taping out chips, a process that can take up to months or even years. The downside of FPGAs is that fine-grained bit-level configurability is inefficient, typical compilation times can take many hours, and the amount of extra wires needed take up much space and are also energetically wasteful. Therefore, FPGAs are commonly used for prototyping a design before it gets taped out, as the resulting chip would be more performant and more efficient than its FPGA equivalent.

Performance, Power and Flexibility Comparison of Processor Architectures (Source: ACM Computing Surveys)

While FPGAs suffer from performance and power overheads, reconfigurability is still a very desired property for AI accelerators. New AI models are experimented with and presented on a daily basis, and the design cycle of a new chip is on the order of 2–3 years. Therefore, a chip that just came back from fabrication and costs millions of dollars was designed based on the assumptions of AI models that existed more than two years ago and might be irrelevant for current models. To combine high efficiency and performance with reconfigurability, some startups design reconfigurable processors that belong to another class called Coarse-Grained Reconfigurable Arrays (CGRAs). CGRAs were suggested In 1996. Compared to FPGAs, CGRAs do not support bit-level configurability and typically have more rigid structures and interconnection networks. CGRAs have a high degree of reconfigurability but at a coarser-granularity than FPGAs (they sacrifice the finer-grained bit-level configurability as it might not be necessary).

Dataflow Processing

Dataflow machines have been around for a while, dating back to the 1970s. They are an alternative form of computing compared to the traditional von Neumann model in which programs are represented as a sequence of instructions and temporary variables. In the dataflow model, the program is represented as a dataflow graph (DFG) in which portions of the input data are computed using predetermined operands and the computer data “flows” all the way to the output according to the graph that is being represented and computed by the graph-like hardware. It is worth noting that hardware is inherently parallel, and the sequential model stems from people who conceived and developed programming languages and practices over the years.

Deep Learning Software to Dataflow Graph Mapping Example (source: Wave Computing — HotChips 2017)

There are two significant benefits of the dataflow execution in the context of AI accelerators: 1. Deep learning applications are structured so that there is a “computation graph” dictated by the hierarchy of the application’s layers, so essentially, the dataflow graph is already baked into the code. In contrast, von Neumann applications are first serialized to a sequence of instructions which later needs to get (re-)parallelized to feed the processor, and: 2. The dataflow graph is an architecturally-agnostic representation of the computation problem. It abstracts away all the unnecessary constraints stemming from the architecture itself (e.g., registers or operands supported by the instruction set, etc.), and the program’s parallelism is limited only by the degree of innate parallelism of the computational problem itself, not by the number of cores, processors registers, or available execution thread pools.

Processing in Memory

A great deal of effort has been spent on increasing an accelerator’s computational throughput (FLOPs), i.e., the maximal number of computations a chip (or system) delivers per second. However, on-chip computational throughput is not the whole story; memory bandwidth is often a performance bottleneck as on-chip computation speeds exceed the speed at which the off-chip memory delivers data. Furthermore, from an energetic point of view, memory access costs are an acute problem in many AI models. Moving the data around to and from the main memory is a few orders of magnitudes more costly than doing the actual computation.

Typical Memory and Compute Costs on 45nm CMOS Technology (source: ISSCC 2014 / M.Horowitz)

The common practice employed by AI accelerator companies to reduce memory costs is the “near-data processing” approach. Companies design small and efficient software-controlled memories (also known as “scratchpad memories”) which store portions of the processed data on-chip, near the processing cores, for high-speed and low-power parallel processing. By reducing the number of accesses to the off-chip memory (the “large and far” memory), they are taking the first step in reducing the time and energy costs of accessing data.

The idea of Processing-in-Memory (or “PIM”) is an extreme next step to near-data processing, and it dates back to the 1970s. In PIM systems, the main memory modules are manufactured with digital logic elements (like adders or multipliers), so the compute processing is located inside the memory. Therefore, there is no need to transfer the stored data to and from intermediary line buffers and to and from the chip. Commercialized PIM solutions are still not very common since manufacturing technology and methodologies are still getting stabilized, and designs are often considered rigid (once a logic element resides in memory, it is hard to repurpose it).

Neuromorphic Computing based on Analog Processing of Dot Products (Source: Nature Communications)

Many processing-in-memory rely on analog computations. Specifically, in AI applications, weighted dot products are computed in the analog domain in a way that resembles how the brain processes signals, which is why this practice is also commonly known as “neuromorphic computing.” Since the computation is done in the analog domain, but input and output data are digital, neuromorphic solutions require special analog-to-digital and digit-to-analog converters that can be expensive in both area and power.

Next Chapter: The Very Rich AI Accelerator Landscape

Previous Chapter: Transistors and Pizza (or: Why do we Need Accelerators?)

About me