A Future of AI: Through The Semiconductor Looking Glass

Adi Fuchs
26 min readFeb 5, 2024

--

“A computer chip with a human brain on it. The chip is holding a magnifying glass” (Created by Bing Create)

It has been a while since I sat down to organize my thoughts and wrap them in a blog post. It seems that every time I publish a new post, the AI semiconductor world has drastically changed since my last one, and this time — AI compute demands and costs are off the charts, so it seems like there has never been a more interesting time to talk about AI in the context of compute.

In this post, I explore the relationship between AI and accelerated processors. I will show why accelerated hardware is advocated as one of the promising drivers of the semiconductor industry in a post-Moore’s law era and how it plays a crucial role in the renaissance of modern AI. I will also show where accelerated processors fall short, how their limitations apply to the AI landscape, and how they will likely affect the future progress of the AI industry. I conclude the post with a discussion on what future directions the AI industry should pursue to achieve an evermore capable AI.

Once again, as I am still trying to preserve my personal touch in these posts — I still did not use any LLMs to generate the textual content of this blog, so it might be less polished, but (hopefully) with fewer hallucinations. I hope this will be an enjoyable and useful read!

The Three Pillars of Modern AI

Artificial intelligence has intrigued our imagination for over 50 years. People have been overly excited about the prospect of intelligent machines that interact with us, serve us, and can change many aspects of our daily lives to enhance our human experience. During these years, many science fiction novels have been written on artificial intelligence, and countless scientific experiments explored ways to make AI useful. Through the years, the scientific community’s approach to AI has shifted between optimism and harsh realism when most ideas to transform AI from obscurity to practice have reached a dead end or were deemed impractical. As AI scientists and practitioners failed to live up to the hype, the field has descended into decades of lower interest and experimentation. These time periods were known as “AI winters”.

Around 2010, optimism returned to the AI field with the rise of a new sub-discipline called “Deep Learning” which involved stacking multiple layers of neural networks. Three pillars laid the foundations for deep learning and made the different pieces finally click:

The Algorithmic Pillar: New algorithmic discoveries enabled new applications and formed a new means of statistical computing, an alternative to traditional “expert systems” which are algorithms and software engineering solutions for precise computing.

In the past decade, we saw many novel studies that demonstrated algorithmic breakthroughs that use neural architectures in several fields, most notably (but not limited to) computer vision and natural language processing.

The Data Pillar: At the core of neural architectures as means of statistical computing lies the data-driven approach, also known as “Software 2.0”. In software 2.0, rather than explicitly defining a set of “rules” for the specific computation task, the task is defined and constructed in a “learn by example” fashion; the neural architecture is trained on prior examples of {input → output} solutions to the task, and based on the neural weight, the architecture can later approximate solutions, i.e., provide outputs to new inputs. The ability of a neural architecture to provide good approximations to never-before-seen inputs depends on a sufficiently large set of examples. In other words, neural networks need lots of data to be effective.

As technology becomes more immersed in our daily lives, more sensory interactions are being recorded, digitized, and stored in cloud entities. As a result, copious amounts of data are now stored in structured and semi-structured datasets, providing more inputs for existing tasks (e.g., more images to train computer vision networks) as well as new input sets to new application domains (e.g., recorded LIDAR data in autonomous driving cars).

The Compute Pillar (The Topic of This Post): Finally, it would not be possible to experiment with new algorithms on large amounts of data without the availability of powerful compute fabrics (high-end silicon chips, memories, and interconnect); if you cannot meet the computing demands for training your algorithm on your dataset (or inferencing it on new data), it would not be possible to prove that your neural architecture actually works.

One notable example of how computing has given a much-needed boost to algorithms is the revolutionary AlexNet case. As presented in an earlier blog post, many of AlexNet’s algorithmic foundations were known years before it was published; people have experimented with CNN architectures for image recognition at small scales, most notably LeNet in 1998, which was simpler and was demonstrated on a smaller dataset called MNIST. As AlexNet was one of the first studies to run CNNs on highly parallel processors called graphics processing units (or GPUs) the authors could train on large amounts of data efficiently and within a reasonable timeframe. So these were the processors, or specifically GPUs, that made a radical paradigm shift in the world of deep learning and became the enablers and drivers of revolutions in deep learning, large language models, generative AI, and more. GPUs are a class of highly parallel processors stemming from a broader notion of specialized processors, also known as hardware accelerators (or simply: accelerators). In a past blog post, I did an overview of what accelerators are, and how the semiconductor landscape gravitated towards accelerators (if you are not familiar with the concept of accelerators, feel free to go back and read the post).

Therefore, if accelerators are both the pillar and the driver of AI, it would be worth exploring their past contributions in other fields to assess how they will shape AI for years to come.

Accelerator Lessons Learned: Is AI Heading Towards Stagnation?

Nowadays, accelerators (GPUs, FPGAs, and ASICs) are everywhere, and that is by no means a coincidence — in many cases, it is a necessity. In my view, that accelerator premise is as follows: “As semiconductor Scaling Laws (e.g., Moore, Dennard) no longer stand, silicon will stop exponentially improving. Therefore, the way to continuously improve chip gains is by using silicon more efficiently, i.e., by developing specialized hardware accelerators.”

Compared to CPUs, accelerators trade programmability and applicability for efficiency by customizing their compute engines, on-chip memories, and on-chip interconnects to a targeted (narrower) application scope. (VERY) Roughly speaking, much of the gains of acceleration come from narrow data paths and interfaces (for example, compact numeric representations like INT8), customized arithmetical units to both support special numeric representation and the fusion of multiple operations, customized memory banking and management, and importantly — compute parallelism. In contrast to traditional CPU applications, accelerated applications like AI have an abundance of parallelizable computations and therefore scale better on silicon. i.e., parallel applications are more likely to benefit from having more silicon (e.g., more cores) than their CPU-based counterparts that are not parallelizable. As such, accelerated hardware has been advocated as the natural successor to Moore’s law in the silicon world that will be shaped in a post-Moore’s law era.

CPU vs. GPU Trends (Source: NVIDIA) — Better Specialization or Better Silicon?

The Bad News (Or: why am I skeptical) In contrast to decades of Moore’s law, specialization does not scale exponentially; when you specialize your chip, you do not change the laws of physics; you’re not creating new transistors, you do not make them go faster, nor making them more power efficient. Given an application domain and a silicon budget, a team of computer engineers and scientists searches for solutions in the application/silicon optimization space. For the accelerator approach to match the exponentially improving powers of Moore’s law, one would need to find a solution that is exponentially better than all previously found solutions within the same silicon-application optimization space, every two years or so. At some point, that becomes impossible because you cannot expect to solve the same problem (any problem) exponentially better every time. You will get better at solving it, and at some point, you will exercise all your capabilities. The same goes for designing accelerated hardware; there is a finite number of ways to express a given computation problem and map it to a chip under a given silicon budget (say, 100 million transistors). One can think of it as a problem of lossless compressing of data from a specific domain (e.g., pictures or audio files) to a given storage device; you can explore better compression algorithms and techniques, but ultimately, you will converge on one that compresses data well, and at which point you will not be able to store more data on that device simply because efficiency does NOT scale exponentially.

Quantifying Acceleration: How Efficiently Do We Use Our Silicon?

Accelerators employ specialization to improve silicon efficiency and keep the semiconductor industry going in a post-Moore’s law era. It is, therefore, worth exploring exactly how (and if) that premise empirically holds.

A few years back, I led a research project that quantified how much accelerators improve from the silicon standpoint; in other words, for a given application, how much chip specialization improves over time, empirically. To do that, we used thousands of chip specs and constructed a model that estimates silicon behavior in terms of speed, power, and energy. We used that model to examine hundreds of accelerator chips in a variety of popular accelerator application domains. Our goal is to decouple two factors that contribute to application gains: better silicon and better specialization. To avoid getting too technical, I am showing here the mere principle and not our entire thought process and rigorous formulation. I use the term “performance” broadly, but it could mean throughput, speed, throughput per power / per area / per energy, etc.

Chip Performance: Driven by Specialization and by Moore’s Law

The above equation is mathematically trivial, yet it conveniently separates three factors: chip performance, chip silicon, and performance per silicon. The first two factors are known: (i) Chip performance could be the attained performance number on real applications, like a chip that trains ResNet-50 at a throughput of 3000 samples per second, or it could be a number specified by the vendor, like peak throughput, for example, a chip that can deliver a maximum of 400 BF16 TFLOPs. (ii) Chip silicon is driven mainly by better-grade silicon, e.g., newer CMOS nodes from newer silicon foundries that produce faster transistors, more active transistors per power, etc. We use our constructed model and the chip’s physical properties (e.g., number of transistors, CMOS node, thermal power budget, etc.) to produce a measure that serves as the chip’s CMOS potential. (iii) Finally, the third factor is the most interesting one: performance per silicon, which means How well does the chip perform compared to its silicon budget?” the idea here is to account for a fair comparison of chips, which surprisingly gets glossed over too often. When benchmarking chips, one should not only account for the attained performance but also how it was attained. For example, one cannot the throughput of a 100 million 32nm transistor chip and a 40 billion 5nm transistor chip, without accounting for the vastly different silicon properties of the two chips. Furthermore, the performance per silicon reflects the quality of the accelerated stack, i.e., better architectural primitives, better compilation frameworks, more appropriate structural sizing of hardware blocks, better chip design disciplines and tools, and a better mixture of compute vs. memory vs. communication. Ultimately, this metric is the performance gained by chip specialization quality, and that is why we refer to it as “chip specialization return” (CSR). While it is generally hard (if not impossible) to quantify the factors that contribute to specialization return, when using the two other known factors of the equation (performance and silicon potential) one can project the trends for all factors given an accelerated application and group of accelerator chips. Finally, in a post-Moore’s law era, specialization might be the only way for silicon to improve, and it is worthwhile to see how specialization design cycles behave, as these would provide a good glimpse into the future of the semiconductor industry.

Acceleration Scaling Trends: Where (And Why) Do Accelerators Fall Short?

AI is not the first application that highly depends on accelerated hardware. There are many specialized processors already around us like signal processors which were first used in the 1970s in TVs, radios, and Radars, and are now used in a wide range of consumer electronic devices like smartphone cameras, speech recognition devices, and so on. However, as noted in the previous section, specialization scaling trends behave differently than their transistor scaling counterparts, and I will demonstrate this using the “transistors vs. CSR” equation presented for two popular applications.

(i) Gaming Frame Rates on GPUs

Games are the classic case of extensive use of GPUs; the video gaming industry is pulling many billions of dollars per year. As the graphics engines in games are getting more sophisticated, graphic rendering tasks are getting more compute-hungry and require better-grade GPUs.

GPU Graphics: Throughput vs. Silicon vs. Specialization (Original Gaming Numbers: AnandTech)

The above example examines several video games. As shown, throughout six years in graphics frame rates have improved by about 4–5x, but when accounting for silicon changes, meaning, newer GPUs also have more transistors, more cores, and faster circuitry (as they were fabricated using newer transistor technology.) the speedup rates from silicon alone were around 4x while the silicon-relative performance (or chip specialization return) improved at much more modest rates of at most 1.27x. The main reason for this behavior is that the domain of graphics is both well-studied (not much to innovate) and massively parallel, which means that applications are likely to scale well as more cores are added — as a result, GPU vendors are incentivized to pack more cores on a chip since that will have a greater impact on performance than most micro-architectural optimizations.

(ii) Bitcoin Miners

Another application space that became radically dependent on accelerated hardware was Bitcoin. At a (very) high level, new blocks are added to the Bitcoin network following a process of “mining” which involves discovering a new cryptographic hash via a brute-force computing of many SHA256 functions. Mining is a compute-intensive process that can cost a lot due to wasted power and to cut the cost-to-profit ratio, miners explore various mining hardware solutions. We study the properties of hundreds of chips ranging from general-purpose CPUs to GPUs to FPGAs and the most efficient yet function-specific ASIC-based miners.

Bitcoin Miners Evolution: CPU, GPU, FPGA, ASIC — Throughput per Area (Absolute vs. CSR)

Each bright dot in the figure signifies the throughput per (chip) area of a different Bitcoin miner; a dark dot in the same x coordinate indicates the relative chip specialization return of that miner (compared to a baseline CPU miner); the bright and dark arrows show the trend lines for performance and specialization improvements, respectively. The figure encapsulates many details; however, several insights are reflected: 1. ASICs do perform incredibly well, as they outperform CPUs by a rate of about 700,000x 2. Switching to a new architectural platform, i.e., moving from CPU to GPU or GPU to ASIC, gives a boost in specialization (each dark arrows start at a much higher mark than the other), but: 3. after switching to a new platform, specialization does not improve (dark arrows are mostly flat). This experiment highlights a typical life cycle of accelerated hardware: a team discovers new ways to map the computation problem to a new chip architecture and gets a boost in specialization, but following a few chip design iterations specialization returns diminish because there is not much innovation left to drive further architectural gains.

The study explored other accelerators and applications, and they all demonstrated similar phenomena that are in line with the insights shown here. To summarize, I list the pitfalls that were inherent limiters for chip specialization return, as observed by the study.

(i) Overly Relying on Parallelism: while parallelism is a sure performance win for most accelerated applications, it is not a measure of efficiency or specialization. Accelerators give one the ability “to do more (compute) with the same cost (silicon, power, money, etc,)” While parallelism is the ability “to do more with more cost (more cores → more silicon)” which does not abide with the accelerator premise: if transistor scaling stops you will not be able to pack more cores and you will not be able to drive chip parallelism beyond a certain degree, without increased costs.

(ii) Design Maturity: Chip development is a HARD business. It takes years of engineering effort and potentially several design generations to optimize a line of chips to a targeted application domain. As development moves forward, the domain matures — and so are the hardware/software co-design practices. From this point, the organization and its engineers converge to a single stable architectural solution and move to an “evolutionary and not revolutionary” path simply because it becomes very costly to break out entirely and explore a different hardware/software design optimization space.

(iii) Domain Confinement: In some applications, computation is dominated by a handful of patterns of blocks. For example, Bitcoin mining is an application mostly dominated by a single computation, which is the SHA256 hash calculation. The most efficient ASIC bitcoin miners are simply packing many SHA256 cryptographic engines in parallel with a shared control mechanism for management. As there are not that many ways to devise a computation circuit that does SHA256, the efficiency of a single SHA256 computation cannot improve by much. This last limiting factor brings us back to the world of AI.

AI Acceleration: the Systolic Lock-in

“A computer locked in a chain” (Created by Bing)

Much like the dominance of SHA256 calculations in Bitcoin mining, modern AI applications are dominated by a specific type of computational block (or pattern) which is matrix multiplication. Convolutional neural network models are lowered to groups of matrix multiplications, and transformers-based models use matrix multiplications that are a part of the multi-head attention mechanism. Hardware optimization of matrix multiplication is more complicated than SHA256; the design space involves not only computing, but also data accessing, scheduling, and communicating. Nevertheless, the presence of a dominating block would eventually result in convergence to specific design points without the ability to make further progress.

Execution Time of Different Operations for a Vision Transformer on an NVIDIA 2080 Ti GPU (Source: Improving the Efficiency of Transformers for Resource-Constrained Devices)

Matrix multiplications became prominent in AI (and other domains like image processing and scientific applications) because they can be computed efficiently using a special circuit called a systolic array. I briefly mentioned systolic arrays in a post on architectural foundations. Systolic arrays exploit the regular nature of matrix multiplication, i.e., data gets accessed and deterministically communicated between multiply-and-accumulate nodes. The systolic structure reduces off-chip memory access costs by a factor of N, which is the number of times the data gets reused between multiply-add nodes before it gets written back to the memory.

Operations Energy Cost for 45nm CMOS (Source: UIUC, based on “Computing’s Energy Problem” M. Horowitz)

Since modern AI applications (like state-of-the-art LLMs) require a lot of memory, as reflected in the table above, it is very beneficial to run them on systolic arrays since these arrays convert the most energy-consuming operation, which is off-chip memory access, to a pipelined register access, which is hundreds of times cheaper.

However, it seems that the AI hardware industry is moving toward a “systolic lock-in”; a systolic array achieves an unparalleled reduction in memory costs → consequently, most (if not ALL) AI accelerator designs are centered around systolic circuitry → therefore, AI researchers only explore models that heavily rely on matrix multiplication, and cannot pursue other directions of a potentially more general form of artificial intelligence, simply because there is no hardware that can run these efficiently → since no one explores algorithms that are not matrix multiplication based, it is hard to incentivize anyone to build a different type of accelerated hardware since there are no new algorithms to provide clear design assumptions and guidelines.

As we have seen here for the other accelerated hardware domains, the benefits of having the same architectural primitives (in this case, systolic-based accelerated hardware) are limited. Furthermore, with optimized stacks, TPU and GPU FLOP utilization rates are already over 50%, and therefore, gains are expected to stop improving soon enough. As we have seen in the Bitcoin mining case, the dominance of matrix multiplication produces domain confinement, meaning, a computational building block that we know how to build efficient circuitry for, and that defines the minimal cost of running that deep learning application. If you want to run bigger models that use X times more matrix multiplications it will cost you at least X times more in energy, and this is where things get tricky and interesting.

Training Computation FLOPs of Popular AI Models (Original Graph Source: Our World In Data)

AI training FLOPs have skyrocketed in the past decades. The compute needed to train the contemporary largest model (Gemini Ultra) is about 1⁰⁸ times the computation used in the largest models circa 2012; this is a testament to the great leap done in terms of compute efficiency over the years: since large model training costs are around tens of millions of dollars, it would have cost less than 10 cents to train the largest 2012 model if it were to be done on today’s efficient hardware/software stack.

NVIDIA MLPerf Training Performance on H100 and A100 compared to 2021 numbers (Source: Trends in Deep Learning Hardware)

However, as we saw for other accelerated applications, the limiters of specialization returns will eventually catch up, and accelerated hardware benefits will stagnate. There are only so many ways to optimize a slow-moving target. As seen in NVIDIA’s MLPerf numbers, in 2.5 years, the performance per the same chip improved by a maximum of 2.5x, but for more mature applications, it improved by lower rates. In a previous blog post, I analyzed the results in the industry-leading MLPerf benchmarks and saw that when normalizing by the number of chips, number of transistors, and transistor performance, the rate of nominal improvement is modest. Therefore, I do not expect an additional 10⁸ growth in compute demand for future models because even in the VERY unlikely optimistic scenario that we will be able to optimize single-chip performance by 1000x (which took NVIDIA ten years), the cost of training a single model would be about a quadrillion dollars, which is ten times the current global economy. However, if single-chip performance does not improve by much (as implied by the systolic lock-in), models will stop growing at the same pace in about 1–2 years.

So What’s Next For AI and Compute?

As the world is still exploring the potential of deep learning applications, and while we don’t know what the future of AI holds, the following will likely remain true:

  1. Computation will continue to dominate AI costs.
  2. Accelerated hardware will continue to dominate the AI computation market.

As AI computation costs already reached staggering rates, we must overcome the limits imposed by the likely stagnation of efficiency in current accelerated hardware, since it is the only way to get to better AI. Here are a few possible directions:

AI’s Dark Silicon Moment: Rethink Model Topologies and Algorithms

It is worthwhile to revisit modern AI’s three pillars: compute (hardware), algorithm (models and learning), and data. On the compute side, we either (1) stick to systolic arrays, which means that one can estimate the costs to do training or inference for a model roughly by calculating the number of matrix multiplications needed, or (2) find a new architectural that can amortize the access costs for the model’s parameters, and that would require a change from the model’s side.

Following the introduction of transformer-based architectures, the AI industry followed a path that slightly resembles the path the semiconductor industry has followed in about two decades. Much like the increase in core count produced more capable parallel processors, the increase in the number of model parameters produced higher quality models that not only perform better at existing tasks but also apply AI to previously unexplored domains. Consequently, companies increased the number of models’ parameters, and model training and inference costs are skyrocketing. Much like the rise in power costs limited the fraction of the silicon-based chip that can be active at a given time, forming a regiment known as “dark silicon”, the AI industry is now experiencing its own “dark silicon” moment. AI companies realize that it would be impractical to perform training and inference using all of a model’s parameters, and only a certain fraction can be used for a given task. Consequently, much like the semiconductor companies’ shift towards hardware accelerators that specialize in certain computation domains, the AI industry is now moving towards more sparse models, and also various forms of “mixture of experts”, where an “expert” model specializes in certain classes of tasks.

While the AI community is still debating whether the mixture of experts is the right approach or whether a single complex model a the right way toward an intelligent future, the limit of systolic-based hardware acceleration would stop the drop in “price per model parameter”. Therefore, reducing the number of effectively-used parameters via experts or other forms of model sparsity will have a limited effect, and we need to explore new intelligent hardware/software co-designs that are not as dependent on matrix multiplications. It is worth noting that natural intelligence is not as regular as matrix multiplications of fixed dimensions; connectivity-wise — the number of synapses, i.e., the number of connections to a single neuron, varies greatly — from a few synapses to about 100,000–200,000 synapses to a single neuron cell in Purkinje cells. Also, the brain is highly specialized, as the number of synapses per neuron depends on its functionality. Different tasks require different amounts of input and neural network topologies.

Better Understanding of Efficient Learning

In the concluding post of the original five blog series, I wrote that: “Utilization is All You Need” and we should strive to get the chip doing as many FLOPs as possible. However, there’s a caveat to that claim. While utilization is a necessary measure, it merely indicates that a chip is doing some computation; it does not measure what that computation is or how effectively it achieves its goal from the application perspective. To demonstrate, imagine that you are driving a car and you squeeze the gas pedal, but your car in first gear: you hear the engine screaming, so the car is not idle, and you see that you are moving forward, so you assume you are making progress. But you are not effectively using your car’s resources because if you have shifted to a higher gear, you might have wasted less gas, and you would be moving faster towards your goal. When we talk about AI training, the goal, at a high level, is for our model to learn. However, since researchers are still exploring ways to make AI explainable, we still do not know what our models learn, so we certainly do not know how efficiently they learn what it is they are learning, or if there’s a better way for them to learn. Therefore, the path to efficient learning passes through the development of more explainable AI and more elaborate benchmarking techniques for learning.

Verticalization and Last-Mile Services: Spend Millions and Save Billions?

Question: What is the one thing the top 9 NASDAQ companies all have in common?

NASDAQ Top 10 Companies (circa December 1st, 2023)

Well… they all develop their own chips for AI.

  1. Apple: The Apple Neural Engine (ANE) has been part of its consumer products SoCs starting 2017, when it was introduced as part of the A11 Bionic SoC (in the iPhone 8).
Evolution of Apple’s ANE peak throughput from 2017 to 2021 (Source: Apple.com)

With newer generations of the Bionic SoC, the ANE, or Neural Processing Unit (NPU), becomes increasingly more important. Over the course of 5 years, the NPU more than tripled in size while other SoC components did not significantly change, and over its peak performance improved by a staggering 26x; an obvious testament to the growing importance that AI has in Apple’s consumer electronics business.

2. Microsoft: Initially, Microsoft used FPGAs in their cloud to accelerate Bing’s indexing and ranking in what was known as Project Catapult and was one of the main motivations for Intel’s $16.7B acquisition of FPGA vendor Altera. FPGAs were later used to accelerate inference as part of Project Brainwave in 2017, and with the recent surge in AI compute demands (and to reduce its dependence on NVIDIA), it decided to switch from the suboptimal FPGAs and fabricate its own efficient in-house AI accelerator chip, the “Maia 100” for AI training and inference for Azure cloud, which is also integrated with its own in-house cloud CPUs name “Cobalt” (both seem like Microsoft’s counter to NVIDIA’s Grace CPUs and Hopper GPUs).

Microsoft MAIA 100 (source: Microsoft)

3. Amazon: Following the acquisition of Annapurna, Amazon started its own line of AI chips for AWS. It announced the first generation of the “Inferentia” chip in late 2018 for ML inference (for example in the Amazon Alexa services) and Trainium for ML training in late 2020. Amazon also offers AWS instances that have Trainium chips and runtime libraries.

AWS Trainium (Source: AWS post)

4. NVIDIA: Well… I guess that goes without saying. NVIDIA’s GPUs have been powering much of the global AI training and inference market.

NVIDIA H100 (Source: NVIDIA)

5. Meta: Meta is going all-in on AI, sparing no costs and leaving no stone unturned in its efforts to achieve a huge and diverse infrastructure (including an estimated $10B they will spend in 2024 on H100 GPUs alone). They have been exploring multiple avenues of AI acceleration. In 2021, they presented an initial study on the deployment of an AI accelerator for inference, and only recently it seems that they published their own in-house “MTIA” ASIC chip in 2023.

Meta’s MTIA AI Inference Accelerator Chip (Source: Meta)

6. Broadcom: Broadcom is NOT technically building AI chips per se, but they DO build chips for network switches specifically tailored to power scale-out AI workloads, so I count them as chips that target AI workloads.

7+8 (I know I’m cheating here a bit) Google/Alphabet: is the most notable example of large scale deployment of DNN-specific cloud ASICs called “Tensor Processing Units” (TPUs). Back in 2011, Google ran some back-of-the-envelop calculations on voice recognition and they concluded that if people were to use voice search for only three minutes a day, Google would have to double their CPU-based datacenters. Therefore they decided to design the first generation TPU that was released in late 2016 and targeted cloud inference for services like voice recognition and following generations (all the way to TPUv5e and TPUv5p) that have both inference and training versions.

Google TPUv5e Rack (Source: Google)

9. Tesla: As a major player in the autonomous driving landscape, Tesla deals with an abundance of data. A single vehicle produces an dozens of TBs of data every and its decision making combines both classical “rule based” algorithm as well as AI models trained on data collected in past driving experiences. To meet these demands Tesla designed a line of AI training chips called “Dojo”. The idea of Dojo is that their tile-based chips can scale-out to comprise wafer-scale chip integrations that power up its AI supercomputers.

Tesla D1 Dojo Chip (Source: Tom’s Hardware)

Bottom line — why do all these companies build their own in-house chips? My takeaway is that counterintuitively, building in-house AI chips is NOT (only) about the chips themselves. It is about full-stack control and customization. From the hardware perspective, most of these chips sport some common architectural foundations: (obviously) systolic computation circuitry, vectorized computation pipelines, interconnections that support tree-like reduction, and some peripheral computations. So when you look at the “number crunching” engines, these chips do not vastly differ from one another. The differentiating factor is vertical control, the company’s ability to have full-stack control over the entire process of developing the hardware, software, and compilation framework and tailor its AI stack to its own needs.

AI is very computationally costly and important, and therefore, every company with enough computational demands and hundreds of hardware and software engineers, is willing to invest tens of millions of dollars to develop a full stack and tailor it to meet its exact computational needs. With the proper mixture of memory, computing and interconnect, the money the company will gain by AI efficiency would far exceed the development costs.

Since all big companies use verticalization, other accelerator companies (like startups and emerging AI infrastructure service providers) need to build teams that specifically tackle the full-stack approach of AI application to silicon with their prospective customers. If I were a company developing AI hardware, having a robust stack would be my primary goal. Efficient hardware is great, but unless you have a clear physical superiority, efficiency will not likely be your moat; you need the ability to cater to your potential clients’ computation goals effectively and easily (i.e., without significant efforts from your client or your company).

Data as the Next Frontier?

While LLMs are conceptually easy (but expensive) to scale up, it is worth noting that much of the quality of larger models stems from their ability to capture, differentiate, and encode many different scenarios. By doing that, large models account for “noisy” training data as well as “good” training data, since (at least the way I see it) the law of large numbers states that if a model trains on enough real-world data, it would statistically converge to “reasonable” behavior. However, if the training data is too noisy and the model is not big enough, it might be possible that the trained model would put emphasis on the undesired training samples, and behave unexpectedly. Furthermore, it might be possible that different training samples would contend and drive the model weight in opposite directions and “cancel each other out”.

While there are known ways to benchmark the quality of models and recipes for “what works and what does not work” in the model world, we are still ramping up our understanding of the data pipelines, data safety and transparency, and data-model interactions. However, if models were to train on curated datasets, we might be able to distill knowledge on smaller and more stable models and save much of the FLOPs needed to train them. Organizations already understand the importance of high-quality data and treat proprietary data as one of their best kept secrets for commercial success, and the MLCommons consortium recently announced a working group for data-centric ML. In a position statement we wrote on VentureBeat, we outlined the parallels between AI and analytics and the reasons why data management and data analytics will become an important building block of future AI systems and have its architecture and infrastructure. Although data processing, curation, and pipeline development require more manual handling and maintenance and have a reputation of being less glamorous than model development, training a foundational model on high-quality data that is cleaner would result in a smaller model that distills a more cohesive knowledge base, can generalize on based a consistent set of rules and might be more predictable and less susceptible to noisy behavior (e.g., LLM hallucinations.)

Final Thoughts

“a light sparking out of a computer chip that has a human brain on it, realistic” (Created by Bing Create)

We are witnessing one of the biggest technological shifts in human history, as the world gravitates towards AI, and whole industries will change by AI in the upcoming years. The proliferation of sparked AI interest came after decades of struggling research that ended with the coming together of three pillars: algorithms, data, and compute in the form of accelerated hardware.

To ensure a clear path for AI to continue evolving, we need to solve the various AI efficiency problems. Accelerated hardware became a key enabler for modern AI as it has improved silicon’s computing efficiency by several orders of magnitude. However, due to the increased dependence of accelerated AI hardware on systolic circuitry for matrix multiplications, the recent AI performance benchmarking trends, and following the trends of accelerated hardware in other domains, AI hardware efficiency may stop improving soon.

Therefore, we must improve the compute pillar by exploring new hardware/software co-designs that are less dependent on matrix multiplications while delivering the same efficiency rates. We must also strive to pursue efficiency in the other AI pillars. Model and learning efficiency are promising paths on the algorithmic side: we need to explore new topologies that express model sparsity better, and new methodologies to evaluate and improve the efficiency in which a model learns from training data. Finally, focusing on the data pillar by building more curated datasets will drop the number of FLOPs needed to learn and generalize and produce smaller models that can generalize better with more knowledge distilled into each of the model’s parameters.

For further read, you can check my earlier blog posts:

The Five-Part Blog Post Series On AI Accelerators — 12/2021

Part I: Intro — What are AI Accelerators?

Part II: Transistors and Pizza (or: Why Do We need Accelerators?)

Part III: Architectural Foundations

Part IV: The Very Rich Landscape (of AI Accelerators)

Part V: Final Thoughts

Ushering in the Next Wave of AI Acceleration — 6/2022

The AI Accelerators Blog Series: (a little over) A Year’s Retrospect — 3/2023

--

--