AI Accelerators — Part IV: The Very Rich Landscape

Adi Fuchs
26 min readDec 5, 2021

--

One of the hottest fields in the global tech industry is AI hardware. Billions of Dollars have been poured at companies that develop accelerated AI solutions via startup acquisitions and fundings, and via the stock market. In this article, we will go over the current state of the AI hardware industry and present an overview of the different bets companies are making in finding the best way to tackle the AI hardware acceleration problem.

Finally, in the spirit of full disclosure, I am employed by one of the companies in this overview. The data presented here is available online and does not disclose any confidential material, and the commentary here represents my own views and opinions. Now that’s out of the way, let us begin.

A Rising Tide Lifts All Boats

It seems like recent years have been the golden age for many AI hardware companies; NVIDIA’s stock skyrocketed by about +500% in the past three years, dethroning Intel as the world’s most valuable chip company, and the startup scene seems to be just as hot; billions of dollars have been spent in funding AI hardware startups to challenge NVIDIA’s AI leadership over the past years.

AI Hardware Startups — Total Funding as of 4/2021 (Source: AnandTech, FWIW Nuvia is not in the AI Business)

Furthermore, there have been intriguing acquisition stories as well. In 2016 Intel bought Nervana for $350M to serve as its AI accelerator function in the growing datacenter business. Interestingly, in late 2019 Intel bought another AI startup called Habana which replaced the solution provided by Nervana. The more interesting part of the story is the whopping sum of $2B Intel paid for Habana; over the course of three years, it was willing to pay almost 6 times more for another solution that tackles the same problem: Datacenter Inference and Training. The takeaways are two folds: (i) Intel believes that the value proposition of AI (in the Datacenter) have greatly increased, (ii) Intel believes that AI is so important it was willing to shift its focus away from the Nervana project at the cost of millions of acquisition dollars and years of human engineering and aim for what it believes is a more promising solution.

The AI chip landscape, or more accurately, the AI accelerator landscape (by now, it goes well beyond just chips) contains a myriad of solutions and approaches, so let’s go over some of them, focusing on the main principles of each approach.

101 Ways to Cook AI Accelerators

NVIDIA: Started with GPUs + CUDA, Aims for Full Control.

“If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” (Seymour Cray)

NVIDIA was founded in 1993, and it has been one of the first major companies of accelerated computing. They have been a pioneering company in the graphics processing units (“GPUs“) industry and later became the world leader providing a diverse line of GPUs for gaming consoles, workstations, and laptops. As discussed in earlier posts, the GPU employed thousands of simple cores and not a handful of powerful cores in a CPU. Originally GPUs were used mainly for graphics, but around the mid-late 2000s, they became extensively used for scientific applications like molecular dynamics, weather prediction, and physics simulations. The new applications, as well as the introduction of C-like software frameworks like CUDA and OpenCL, paved the way for porting new domains into GPUs, and therefore GPUs gradually became General-Purpose GPUs or: “GPGPUs.”

ImageNet Challenge: Winning Error and Percentage of Teams using GPUs (source: NVIDIA)

Historically, one might claim that NVIDIA was lucky that modern AI started when CUDA was so popular and mature. Alternatively, one could argue that it was the maturity and popularity of GPUs and CUDA that made it possible for researchers to develop AI applications conveniently and efficiently. Either way, history is written by the winners — and the fact is that the most influential AI works like AlexNet, ResNet, and Transformers were implemented and evaluated on GPUs, and when the AI Cambrian explosion took place NVIDIA was leading the pack.

SIMT Execution Model: Code Divergence and Synchronization Example (source: NVIDIA)

GPUs follow a programming model called single-instruction-multiple-threads (SIMT), where the same instruction executes concurrently on different cores/threads, each on its own portion of data as dictated by its assigned thread ID. All cores run the threads synchronously in lock-step, which greatly simplifies the control flow, and works great for domains like dense linear algebra, which neural network applications heavily rely on. On the flip side, SIMT is still conceptually a multi-threaded c-like programming model that was repurposed for AI, and not designed specifically for AI. Since both neural network applications and processing hardware can be described as a computation graph it might be more natural and efficient to have a programming framework that captures graph semantics, rather than serialized an AI graph to a sequence of instructions, and re-parallelize it so it can run on the parallel hardware.
This is the main selling point of AI hardware startups and companies that develop their own AI hardware. While shifting from CPUs to GPU architectures is a big step in the right direction, it’s still not enough. GPUs are still conventional architectures that rely on the same computing model as CPUs. CPUs were confined by their architectural limits and got replaced by GPUs in domains like scientific applications since they provided a better alternative. Therefore, by co-designing a computing model and hardware specifically for AI, newcomers could propose an architecture whose potential gains exceed those of GPUs and claim their stake in the AI application market.

NVIDIA’s Roadmap of GPUs, CPUs, and DPUs (source: NVIDIA)

On NVIDIA’s side, it charges the AI battlefield from two main angles: (i) asserting its market dominance by strengthening its line of GPUs with new architectural innovations in the form of Tensor Cores (which are accelerator cores for systolic operations) and strong collaboration with the research community. (ii) Full control of the system stack, with multi-billion dollar acquisition like Mellanox to attain smart networking processing capabilities in the form of Data Processing Units (or ”DPUs”), and the forthcoming acquisition of ARM which sells CPU cores and IP rights that are a part of billions of phones and mobile devices around the world. The first announced ARM-NVIDIA collaboration is a datacenter CPU called “Grace” which would allow NVIDIA’s GPUs fast access to large DRAM memory spaces since their current datacenter GPU lines (V100 and A100) are equipped with HBMs which are fast but have limited size. NVIDIA envisions lines of mobile systems, autonomous cars, laptops, and datacenters, each with a vertical “all-NVIDIA” software-hardware-system stack of CPUs and GPUs (and DPUs for datacenters), and that would put a high bar for NVIDIA’s competitors.

Cerebras: Go Big with Wafer-Scale Engines

One of the more interesting companies in the landscape is Cerebras, which was founded in 2016. As AI models become increasingly complex, training them requires more memory, communication, and computing power than a single processing chip can provide. Therefore, Cerebras have designed a Wafer-Scale Engine (WSE) which is a “chip” the size of a Pizza box. To get some context as to why I wrote “chip” in air quotes and what WSE actually means, I’ll get a little bit into the process of chip manufacturing.

Andrew Feldman, Cerebras CEO with the WSE-2 (source: IEEE spectrum)

Typical processing chips are fabricated on a piece of silicon called a “wafer.” As part of the manufacturing process, the wafer is dissected into smaller pieces called “dies” which make what we called a “processing chip”. A typical wafer holds hundreds or even thousands of such chips, with each chip size typically ranging between 10 mm² and up to around 830 mm² (the manufacturing limit for a single die). NVIDIA’s A100 GPUs are considered almost the biggest chip possible at 826 mm² which allows them to pack 54.2 billion transistors powering around 7000 processing cores.

Cerebras WSE-2 vs NVIDIA A100 spec comparisons. (Source: BusinessWire)

Now back to Cerebras — their “chip” is a single wafer, meaning no dissection, just keep all manufactured chips on the big piece of silicon, and this would be the chip. Compared to the A100, Cerebras’ second-generation WSE (WSE-2) houses 123x more cores, 1000x more memory at 56x more the size. While they did not explicitly disclose any power consumption numbers they are assumed to be in the 15k-16k watt range, which is about 37–40x more power compared to A100’s 400W power envelope. Their approach is that their chip is large enough so even big models can fit on WFE2’s 40GB on-chip memory space, which is at the same size as the first A100’s off-chip memory (later upgraded to 80GB.)

Cerebras is not only about housing supercomputer capabilities on a single large chip, they also provide a glimpse at their software stack and compiler toolchain via collaborations with academic institutes and US national labs. Their software framework is based on a Linear-Algebra Intermediate Representation (LAIR), and a c++ extension library that can either be used by low-level programmers to write manual kernels (similar to NVIDIA’s CUDA) or it can be used to seamlessly lower high-level python code from frameworks like PyTorch or TensorFlow.

To conclude, Cerebras’ unconventional approach is intriguing many in the industry, as it seems a very daring challenge to take if you were to be a multi-billion dollar corporation, let alone a startup! It seems that from the engineering perspective, they are in uncharted territory in almost every aspect of the stack: larger chips mean a higher probability for cores and processors to fail due to defects so how do you control manufacturing defects? How do you cool down nearly a million cores? How do you synchronize them, and how do you program them? How can you control the integrity of a signal going through such a long route? A lot of questions, but one thing is for sure, Cerebras got a lot of people’s attention.

GraphCore: In-House Software Stack + Multi-threaded Dataflow Execution

GraphCore is among the first startups to present a commercial AI accelerator, called the “Intelligent Processing Unit” (or IPU). They announced several collaborations with Microsoft, Dell, and other commercial and academic institutes. They are already shipping their second-generation IPU. Their solution is based on an in-house software stack called “Poplar”. Poplar enables lowering Pytorch, Tensorflow, or ONNX-based models to an imperative, C++ compatible code, in favor of what the company termed as: “vertex programming”. Like NVIDIA’s CUDA, Poplar also supports low-level C++ programming of kernels to achieve better potential performance (compared to python applications that were lowered via the Poplar compiler.)

2nd Generation IPU High-Level ChipDiagram (source: GraphCore)

An IPU consists of a “tiled” many-core design. Tiled architectures were invented in the early 2000s at MIT and describe 2D grids of replicated structures, each combining a network switch, a small local memory, and a processing core. The first generation IPU had 1216 tiles, and the current second-generation IPU has 1472 tiles. Each IPU core can execute up to 6 threads, which are streams of code containing their proprietary instruction set architecture (ISA). GraphCore called their ISA the “Tile ISA” and mentioned that their ISA was specifically designed for AI applications.

Example of a Bulk Synchronous Parallel Execution Timeline (source: GraphCore)

The IPU has a “Bulk Synchronous Parallel” execution model. The idea is that all of the chip’s tiles operate synchronously according to three general execution phases. (i) Compute: all tiles perform the mathematical computations specified by their assigned threads. (ii) Sync: a phase in which we wait for all tiles to finish their execution. (iii) Exchange: all the computed output data is written to the exchange memory, and if needed, it will be used as input by (potentially) other tiles in the next Compute-Sync-Exchange phase.

Reconfigurable Dataflow — Wave Computing, SambaNova, SimpleMachines

Wave Computing, SambaNova, and SimpleMachines are three startups that have presented accelerator chips whose foundations combine two concepts (overviewed in the previous chapter). (i) reconfigurability: from a processor taxonomy’s point of view, their accelerators are classified as Coarse-Grained Reconfigurable Arrays (CGRA) originally suggested in 1996. They describe a software-defined hardware approach in which the compiler determines the structures of computational datapaths and the behavior of on-chip memories. (ii) dataflow: their designs rely on graph-oriented hardware based on the dataflow graph that is laid out by the AI application. Conceptually, since AI applications can be defined by a computational graph, they can be naturally presented as dataflow programs.

Wave Computing (No longer active)

Time-Based Mapping of DPU Kernels (Source: Wave Computing)

Wave Computing was founded in 2008 — before the golden age of AI had started. It has been in stealth mode for a while, getting funding from various sources and probably pivoting over the years before converging to an AI accelerator called the “Dataflow Processing Unit” (DPU) that became accessible around mid-2017 (before many of the startups overviewed here were founded).
In retrospect, Wave Computing’s DPU was revolutionary and daring (arguably, Wave Computing’s overly ambitious approach might have contributed to its demise) <Overly_Detailed_Hardware_Alert> It ran 16,000 processing cores at a very high clock frequency of 6.7GHz which, at these high speeds, they needed to create a globally asynchronous design since it’s impossible to synchronize the different core clocks in remote parts of the chip. They had controllers that access both high-speed low-capacity memory cubes (HMCs) and low-speed high-capacity DRAM memories (DDR4). Therefore, they needed to support smart off-chip memory scheduling to master the heterogeneity of the two memory types. </Overly_Detailed_Hardware_Alert>
Around 2019, Wave Computing decided to abandon its AI business and bought the IP rights for the general-purpose MIPS instruction set architecture and its line of processor IPs, but unfortunately, in 2020, it filed for Bankruptcy.

SambaNova

Founded in late 2017, SambaNova has gained a lot of traction with multiple announcements of the biggest-ever funding series in the landscape ever since.

SambaNova’s RDU Block Diagram (source: SambaNova)

SambaNova is building chips, racks, and software stacks for datacenters, targeting AI inference and training. At the core of its architecture lies the “reconfigurable dataflow unit” (RDU). The RDU chip contains an array of compute units (called “PCUs”) and scratchpad memory units (called “PMUs”) organized in a 2D-mesh hierarchy interconnected with NoC (network-on-chip) switches the RDU accesses off-chip memory using a hierarchy of units called AGUs and CUs.

SambaNova’s Key Use Cases (source: HPCWire)

SambaNova’s software stack (called “Sambaflow”) takes high-level python applications (e.g., in PyTorch, TensorFlow) and lowers them into a representation that can program the chip’s switches, PCUs, PMUs, AGUs, and CUs at compile-time, with the dataflow inferred from the original application’s computation graph. SambaNova has showcased how the RDU architecture runs complex NLP models, recommender models, and high-resolution vision models.

SimpleMachines (No longer active?)

SimpleMachines was founded in 2017 by a group of academic researchers from the University of Wisconsin. Their group has been exploring reconfigurable architectures that rely on heterogeneous datapaths combining both von Neumann (instruction-by-instruction) and non-von Neumann (i.e., dataflow) execution. Much of the data provided by the company refers to the original research papers published in top-tier venues. The guiding architectural principles seem to be somewhat similar to what SambaNova is doing: develop a reconfigurable architecture that can support non-conventional programming models to achieve flexible execution capable of dealing with a highly-changing AI application space. SimpleMachines tries to expand the model further to support both traditional (von Neumann) and non-traditional datapaths under the concept of “composable computing” (i.e., dividing a configurable fabric into sub-functional datapaths and fusing them to provide a mix of generic and domain-specific accelerations.)

SimpleMachines’ Mozart Chip Block Diagram (source: SimpleMachines)

In 2020 SimpleMachine released their first accelerator generation, based on a chip called “Mozart.” Mozart consists of an array of configurable tiles that rely on the specialization of control, compute, data-gather, and synchronization elements (originally, the primary keys for specialization were concurrency, data reuse, concurrency, coordination, and communication, as suggested in one of the papers that laid the foundations for Mozart.) Finally, according to LinkedIn, just before I started doing this survey, the people listed as the company’s leadership have all moved to positions in different places. Therefore, it is likely that SimpleMachines is no longer active.

Hailo: Efficient Dataflow for Edge Inference

Hailo is an Israeli startup founded in early 2017, and it is targeting edge devices, i.e., mobile devices like phones or cameras which have typically low power budgets (<5 Watts) since they cannot pack massive cooling solutions, and usually need to retain long battery life.

In contrast to datacenter-based platforms, the edge-based platform space spans different design points with different goals; typical AI practices for edge devices use models with a more modest number of parameters and/or low computation precision (e.g., 4-bit integers), simply because it is impossible to have complex hardware that operates at high computing rates without exceeding the chip’s power budget. Furthermore, since model predictions are made for a small context (for example, a single image at a time), edge devices sometimes have real-time requirements. This means the computation’s latency, i.e., the time between the computation starts and finishes, is bounded by a rigid constraint. For example, it is critical for an object-detection system in an autonomous driving car to detect objects and communicate its decisions in less than a few milliseconds to avoid hitting an obstacle on the road.

Hailo Hardware-Software Stack (source: Hailo)

Hailo developed an in-house dataflow compiler that has native TensorFlow and Keras support (and ONNX support which means it can use a portable format imported from other frameworks like Pytorch). The dataflow compiler parses the high-level python code and lowers it to operations that construct the layers of the models, next, if needed, the compiler reduces the numeric representation and maps the layers to the available on-chip resources which are computation (i.e., arithmetic units that perform operands such as multiply-accumulate), memory (software-defined scratchpad memories holding data like convolution layer weights), and control units (which are in charge of orchestrating and timing the operations.)

Possible Mapping of an 8 Layer Model to a Pool of Compute+Control+Memory Resources (source: Hailo)
Hailo 8 Based PCBs (source: Hailo)

Hailo’s main line of products is based on the Hailo-8 accelerator chip which supports a throughput of up to 26 TOPS at a typical power consumption of 2.5W (no data was provided on max power). It is sold on printed circuit boards (PCBs) connected via PCIe as an external card connected to an existing computer system. Hailo currently has three offerings that differ in form factors and PCIe interface width.

Systolic Arrays + VLIW: TPUv1, Groq, and Habana

TPUv1

One of the world’s first processors tailored specifically for AI was Google’s Tensor Processing Unit or “TPU.” The TPU was presented in 2017 as a programmable Application-Specific Integrated Circuit (ASIC) for Deep Neural Networks (DNNs), meaning: it was a chip designed and fabricated from the ground up for a specific task (or a set of tasks). The motivation for the TPU was a study done internally by Google’s scientists, in which they concluded that the rise in computing demands of Google Voice search alone would soon require them to double their datacenter if they were to rely on traditional CPUs and GPUs. Therefore, in 2015 they started working on their own accelerator chip designed to target their internal DNN workloads, such as speech and text recognition.

Block Diagram of The First-Generation TPU Architecture (source: arXiv)

The first-generation TPU was built for inference-only workloads in Google’s datacenters, and combined a systolic array called a “Matrix-Multiply Unit” and a VLIW architecture. In later generations, Google’s engineers designed the TPUv2 and TPUv3 to do AI training. They used larger matrix multiply units and added new tensor units like the transpose-permute unit. They used liquid cooling and a purpose-built torus-based interconnection architecture that scales to thousands of accelerators which form an AI training supercomputer. If you would like to further dig into the process of architecting AI training accelerators, I highly recommend reading Google’s TPUv2/3 CACM paper or watching Prof. David Patterson’s talk in UW.

Groq

While TPUs are available in Google’s cloud offerings, their goal was to satisfy Google’s AI demands and serve its own internal workloads. As such, Google tailored the TPUs its specific needs, and it is not particularly aiming for massive commercialization of TPUs and compete head-to-head with the other companies. Therefore, in 2016 a team of some of the TPU’s architects decided to leave Google to design a new processor with similar baseline characteristics to the ones of TPU and commercialize it in a new startup called “Groq.”

Groq TSP Execution Block Diagram (source: Groq)

At the core of Groq’s solution lies the Tensor Streaming Processor (or TSP), targeting datacenter inference. The TSP architectural structures have a lot in common with the TPU; both architectures heavily rely on a systolic array to do the heavy lifting. Compared to the 1st generation TPU, the TSP added a vector unit and transpose-permute unit (which can also be found on the second and third-generation TPUs).

Groq VLIW Instruction Set and Description (source: Groq)

Habana

Aside from the technological aspect, Habana also has an interesting approach and story. It was founded back in early 2016 as an AI accelerator company targeting training and inference in datacenters. In contrast to most startups in the AI data center landscape, Habana presented an open approach to its chip offerings and their performance. In about three years they showcased two chips for different applications: Goya for inference and Gaudi for training. Given that it would take you at least two years from the initial planning to have a new chip working, and that is when you already have a team (not building one in your new startup) and you’re lucky enough to have a successful tape-out, they operated at very ambitious timelines.

Goya and Gaudi High-level Block Diagrams (sources: Habana 1,2)

The Goya and Gaudi chips have similar basic architecture. Both chips rely on a GEMM engine which is a systolic matrix multiply unit, working side by side with an array of tiles. Each tile contains a local software-controlled scratchpad memory and a Tensor Processing Core (TPC) with vector computation units of varying precision, i.e., they can compute vectorized operations with 8-bit, 16-bit, or 32-bit members. The TPCs and GEMM Engines communicate via DMA and shared memory space, and communicate with the host processor via PCIe.

By examining the differences between Goya and Gaudi’s architectures, it is possible to deduce the different emphasis for inference or training. a. Scale-out Capability: Importantly, Guadi has a built-in engine for Remote-DMA (RDMA, or more accurately RDMA over Converged Ethernet, or “RoCE,” let’s not go there at this time) — the benefit of RDMA is the ability to perform direct accesses (=no CPU or operating system involved) to memory spaces of other systems in the datacenter. By doing that, you efficiently communicate data and can scale out the application beyond a single chip’s memory space, thus leveraging the compute and memory power of multiple chips. This is done since the training of modern AI models is very compute and memory heavy, and for large models like the GPT3 language model, training would take a few lifetimes on a single, high-performance GPU. High-performance RDMA communication is one of the main reasons for NVIDIA‘s Mellanox acquisition, but compared to the Mellanox Infiniband RDMA, offering Habana’s RoCE is a more affordable alternative. b. Memory Intensity: While Goya uses traditional 16GB DRAM as off-chip memory, Gaudi uses 32GB High Bandwidth Memories, or “HBMs.” As the name suggests, HBMs are memories that have high bandwidth, and they are needed in order to sustain the high data rates of modern training applications: compared to Goya, they doubled the memory size and increased the memory bandwidth by order of magnitude (while Gaudi’s HBMs deliver 1TB/s Goya’s usage of two DDR4 channels implies a typical bandwidth of about 30–40GB/s). c. BF16 Support: One of the main problems of AI is how to find the sweet spot between efficiency and accuracy; specifically, floating-point operations are common in training since they are used by batch normalization (an algorithm that speeds up training convergence); while 32-bit floating-point (FP32) variables are accurate, they cost a lot of computing power, and while 16-bit floating variables are (FP16) are cheap, they weren’t sufficiently accuracy. Therefore, in 2018, a team of engineers from Google Brain discovered that the main problem with FP16 was that the representation range was not wide enough. Therefore they invented the “Brain float16” (BF16) standard that uses fewer bits for ”mantissa” (which determines the smallest fraction you can represent) but uses more bits for “exponent” (which determines the largest absolute values you can express, i.e., the number range).

Habana’s Goya MLPerf Inference Results (source: Habana)

While other startups focused on engaging with private customers in tackling their private models, Habana was one of the few startups releasing results for “MLPerf” — the industry’s joint effort of a standardized benchmark suite for AI processors. Habana released its Goya inference performance results in November of 2019 and went head to head with the big corporations like NVIDIA, Google, and importantly — beating Intel’s Datacenter inference processor, NNP-I. A month after the MLPerf results were published, Intel acquired Habana for 2 billion dollars to replace the existing solutions, so it definitely seems that Habana’s MLPerf approach paid off (pun intended — sorry, couldn’t myself.) On a side note, I have some reservations about MLPerf, but that’s a topic for a whole other post.

RISC-Based AI Accelerators: Esperanto and TensTorrent

Esperanto

Esperanto was founded back in 2014 and remained in stealth mode for quite some time until announcing their first product at the end of 2020, the ET-SoC-1 chip. The ET-SoC1 is a RISC-V based heterogeneous many-core chip, with 1088 “Minion” low-power low-voltage cores for vectorized computations and 4 out-of-order “Maxion” high-power general-purpose cores, enabling the usage of ET-SoC1 as a host processor (meaning the processor running the operating system) and not only as a standalone accelerator connected to the host via PCIe). The ET-SoC1 is targeting datacenter inference workloads, currently demonstrated on large recommendation models.

Full-Chip Block Diagram of Esperanto’s ET-SoC1 with a Handful of Big “Maxion” Cores and Many Little “Minion” Cores (source: Esperanto/HotChips)

An interesting property of Esperanto’s ET-SoC1 is its low operating power of 20W, which is unusual in a landscape of chips reaching 200W-400W power budgets (a subtlety here is that the power budget is not the same as typical power, but a worst-case maximum). This is (presumably) enabled by the programming capabilities of the general-purpose RISC-V cores that can benefit from programmer-controlled sparsity in large embedding tables (common in commercial recommendation models).

Physical Properties of x86 CPU vs. Esperanto ET Minion many-core Chip (source: Esperanto/HotChips)

To achieve high energy efficiency rates, Esperanto spent a great deal of effort in micro-architectural and low-level optimizations in chip circuitry and floor-planning, to be able to eliminate a great deal of the dynamic energy spent on processing and communicating data over the chip’s wires.

TensTorrent

Toronto-based TensTorrent was founded in 2016 and is currently valued at 1 billion dollars. From several talks given at venues like HotChips or the Linley Processor Conference, TensTorrent is offering several lines of chips targeting not only datacenters, but smaller platforms as well. They also offer access to their own DevCloud.

TensTorrent Approach — Graph Parallelism and Tensor-Slicing (source: YouTube/TensTorrent)

From online talks, TensTorrent’s approach is built upon a “resizable” chip architecture that can be customized to fit small chips as well as large chips (scale-up), and potentially many rack supercomputers using a scalable interconnect hierarchy that connects processing cores as well as systems (scaleout).

TensTorrent Core (source: YouTube/TensTorrent)

The core idea of TensTorrent is a combination of pipelining the computation graph and slicing the model’s tensor into packets while exploiting tensor sparsity (tensor data contains many 0s, therefore matrix multiplications can be efficiently done by removing a lot of redundant operations.) TensTorrent’s are based on a tiled architecture, which combines an AI compute unit and processing cores that are based on the RISC-V instruction set architecture. Specifically, they recently announced a partnership with RISC-V startup SiFive in which they will use their Intelligent X280 RISC-V processors as the baseline core for their next-generation accelerators. The X280 comes with an in-order processor that has both vector and scalar compute units and memory. Lastly, as I was writing this post, TensTorrent announced that they are building their own RISC V-based high-performance CPU processors that will connect to their AI accelerator through the same communication network (NoC) that will allow them to extend their cluster and combine high-performance host processors and accelerators, and gain by having control over how the processor and accelerators communicate.

Mythic

Mythic is one of the earliest startups in the AI hardware landscape. It was founded back in 2012, at a time where AI hardware was in its infancy. Mythic’s approach is based on analog matrix processing. The core idea is that the most compute and data-intensive parts of AI applications are matrix multiplication kernels.

Weights and Input/Output data Difference in a Matrix Multiply Operation (source: Mythic)

Specifically, in inference applications the inputs and weights have different characteristics; weights typically occupy a lot of memory, while each weight is used only once to compute the output (weights are not updated when doing inference.) In contrast, the input data is smaller and each input element is multiple times. Therefore, while input data can be stored in the on-chip memory (SRAM-based scratchpads or cache memories) the weight data must be stored off-chip, thus accessing weights becomes costly, so their first observation is that Efficient DNN processing requires a reduction in weight memory access costs. The second observation (also shown by other PIM studies) is that since DNN computations are greatly dominated by “multiply-and-accumulate” (MAC) primitives i.e., OUT_k=W_i*IN_i+W_j*IN_j+… it is possible to design a highly-parallel analog compute unit using basic electrical circuits theorems.

Mythic Analog Compute Engine — Flow Diagram and Analog Computation (source: Mythic)

To address these issues, Mythic’s designed an Analog Compute Engine (ACE) that uses flash memory, not DRAM, to store weights — and essentially, instead of fetching both the input and weights data from memory, they transfer the input data to the weight flash memory, convert it to the analog domain, perform the MAC computations in the analog domain, and convert it back to get the output data, and by that they avoid the costs of reading and transferring weights from memory.
While Mythic originally targeted inference for both datacenters and edge devices, its current line of Analog Matrix Processing (AMP) products is solely targeting edge devices. One probable reason for this decision is that for edge devices, the impact of core-to-memory efficiency is higher and is easier to predict, whereas, for larger integrations such as datacenter environments, we would need to take into account a wide array of system-level overheads, like network stress or power supply considerations. One other subtlety is that using flash memories would probably not work for AI training as weights need to be updated; flash memories often have a limited number of writes (typically a few thousands) before they become unusable.

LightMatter: Photonics-Based Analog Computing

As transistor scaling is stagnating and has an unclear roadmap, LightMatter is seeking alternative technology mediums that might hold great promise for a sustainable compute-intensive future. Specifically, they target AI datacenter inference applications using silicon photonics, which means they have fiber optic cables connected to their chip and use lasers.

Photonics vs. Electronics Calculation Properties (Source: HotChips/LightMatter)

They designed a systolic array-based that performs multiply-and-accumulate operations by manipulating the photonics input signals using phase-shift encoded as different phases in the light signal’s waves (similar to how signals are being modulated when they are sent over fiber optic cables). As photonics data flows at the speed of light, LightMatter’s chips can potentially perform matrix and vectorized operations at very high speeds and potentially orders of magnitude less power.

Envise-Based Server Block Diagram (source: LightMatter)

LightMatter’s product offering consists of three components: The “Envise” processor which is embedded in their servers, the “Passage” interconnect which offers up to wafer-scale optics processor-to-processor communication to potentially build photonics-based supercomputers, and the “Idiom” software framework that takes python computation graphs (e.g., PyTorch, TensorFlow, etc.) and stamping matrix and vector kernels to be computed on the optics systolic fabric.

The technology seems very promising, and as Moore’s law is slowing down eventually there will be a limit to what we can get with transistors-based chips. I truly believe that without any innovation in the physics domain or developing domains that do not rely on transistors alone, even hardware acceleration will reach the end of the road (source: shameless self-promotion of past research). I think the main challenge here is to do the conversion from and to the analog domain since like other analog compute systems they need digital-to-analog and analog-to-digital converters. The conversion from analog to digital is power-hungry and it was shown that the power and area requirements increase by 2–8x for every bit added, so that is why they support up to 16-bit operations.

NeuReality: TCO-Driven Approach to Eliminate System Overheads

NeuReality was founded in early 2019 and “semi” came out of stealth-mode in 2021 (“semi” because they mainly share a vision, but were yet to disclose too many details on their system). While accelerator vendors sometimes see their hardware as the center and most important element in the cloud, other things are happening in the cloud and these affect all of the cloud’s applications. Neureality’s observation is that system-level imbalance harms the productiveness of AI-as-a-Service clouds: there’s a great deal of heterogeneity in the application space which is hard to manage and induces underutilized accelerators and/or CPUs; it is often hard to efficiently time (or “schedule”) different applications in a way that will employ the variety of our processors in a cloud or datacenter environment while taking under account network traffic, and management overheads of the operating system (and hypervisor). At cloud scale, these inefficiencies amount to millions of dollars lost daily due to wasted electricity or suboptimal performance.

NeuReality NR1-P Prototype (source: ZDNet)

In February 2021, NeuReality’s unveiled “NR1-P ”their AI-Centric cloud rack prototype, based on Xilinx FPGAs card that has both general-purpose (ARM) processors and a VLIW-based acceleration fabric capable of running AI applications. In November 2021, NeuReality announced a partnership with IBM which would include licensing IBM’s low precision AI cores to build the “NR1", a non-prototyped production-grade server, which, compared to an FPGA prototype would have higher efficiency for AI applications.

From online interviews, it appears that NeuReality’s high-level vision is to have a cloud server that targets AI applications and their OS as a contained application space that would be running separately from other cloud apps. By doing that they would be able to co-design CPUs with accelerators, and potentially network and memory, to deliver a system-level optimized cloud server.

Conclusions

Many companies are developing their own AI accelerators. I highlighted some of their implementations here, though not all of them, as my writing would not keep up with the influx of new announcements.

It appears that in the field of AI accelerators for datacenter many companies set to target the better accelerator coupling with the CPU and network:

  • NVIDIA is betting on new lines of CPU and DPUs.
  • NeuReality is building system-centric AI servers.
  • TensTorrent decided to design their own RISC-V processors that will connect to their accelerators via their proprietary NoC.
  • Esperanto’s chip has a combination of many little cores for AI and a few big cores capable of running the operating system.

It was also interesting to see the relations between academia and industry. On the application side, almost all widespread AI models were originated in academic papers, many of which were affiliated with companies. However, on the hardware side, there seems to be some disconnect between academia and industry. While in the past five years, hundreds of AI accelerator papers were published in leading venues like ISCA, HPCA, MICRO, DAC, ISSCC, and VLSI, only a handful of those made it into actual products. Most companies’ core architectures are based on “stable” ideas that have already been experimented with within different contexts. Notably, companies combine VLIW architectures with systolic arrays (Google, Groq, Habana) or support the dataflow execution of compiler-lowered computational graphs (Wave Computing, SambaNova, SimpleMachines, Graphcore, Hailo, and possibly, TensTorrent and Cerebras.) I think hardware’s academic-industry disparity stems from the fact that things move much slower than in the applications world, and projects are more costly. Experimenting with new hardware ideas in a competitive landscape is risky since it requires a lot of human and financial resources to materialize these ideas. The typical scope of hardware research papers in academic projects is much narrower than mature products, as hardware verification alone would take many years of simulation time. It is not to say that there’s a lack of innovation on the contrary — the optimist in me thinks there is still a lot of headroom for innovation. I think that once the landscape establishes the baseline architectures, we will see more of these ideas getting materialized.

Next Chapter: Final Thoughts
Previous Chapter: Architectural Foundations

About me

--

--