CPUs, GPUs, TPUs: What is powering AI under the hood?

Published in

Poatek

6 min readJul 26, 2023

Artificial intelligence (AI) is computationally demanding. Training the latest state-of-the-art models requires extensive computational power, often involving many powerful computers running for days. It is no coincidence that AI’s rapid growth was simultaneous with the advancement of computational capacities. For instance, in 2020, Microsoft announced that the system designed for OpenAI had 285,000 CPU cores and 10,000 GPUs [1].

But why do these models train (faster) on GPUs? Why do we not use GPUs for everything, then? And why is there still space for improvement with other types of units like TPUs?

Let’s start by understanding the ubiquitous computational unit: the CPU (Central Processing Unit). CPUs serve as the backbone of various devices, from smartphones to satellites. Initially, CPUs functioned as a single processing unit, capable of managing one stream of instructions at a time. Although they had multiple internal building blocks, only one operated at any moment. This approach is SISD (Single Instruction, Single Data) [2]. However, there is a limit on the speed of a single processing unit; the power and heat dissipation are the main culprits here. So how can we accelerate the power-hungry problems? While there is a universe of optimizations and parallelism inside a single CPU, it is a story for another time.

One of the most straightforward ideas was to ship multiple CPUs together into a single chip that could work on the same or multiple problems in parallel, and in this way, the multi-core CPUs were born. That is the type of information we see today in commercial multi-core CPUs. In the early 2000s, Intel® CoreTM2 Duo was shipped with two cores [3]. While today there are CPUs with 128 cores. In these systems, each core is complex, powerful, and expensive.

The other idea was to have multiple chips per motherboard, with various slots (sockets) for distinct CPUs. In the end, the systems of these two ideas can perform different instructions, each with its data. Those systems are known as MIMD, issuing Multiple Instructions over Multiple Data.

While we can use these systems to solve any solvable problem in parallel, there are cheaper and more intelligent ways. The main trick? Many algorithms apply the same operation over different data, a condition known as DP (Data Parallel).

An essential operation for many AI methods and other computationally intensive tasks is matrix multiplication. Each cell in matrix C, which results from multiplying matrices A and B (C=AB), performs:

Thus, the same operations (fetching and saving memory, summing, and multiplying) are performed on continuous data.

The graphical pipeline, responsible for generating images, operates similarly. It applies the same operation repeatedly to different data, such as polygons or pixels. To handle such tasks efficiently, the GPU (Graphical Processing Unit) was developed. Unlike CPUs with complex units capable of executing different instructions, GPUs employ simpler units, each handling its input data while globally performing the same instruction. This architecture, known as SIMD (Single Instruction Multiple Data), simplifies hardware and allows more (simpler, slow, and with fewer capabilities) units to be packed into the same chip space. A nice analogy comparing a CPU core to many GPU cores is:

The potential of this approach became evident through its widespread availability (mainly because of video games). Researchers ingeniously repurposed GPUs, “hacking” the graphical pipeline to perform general computations by manipulating graphical inputs for general algebraic operations (with the final results extracted from the final graphical pipeline image color matrix!).

This led NVIDIA, a major GPU company, to evolve its GPUs into GPGPUs (General Purpose Graphical Processing Units). Enabled by CUDA, introduced in 2007, GPGPUs became programmable for general purposes, proving instrumental in accelerating AI advancements. For instance, the 2012 publication of AlexNet, a deep neural network with remarkable AI performance, was only feasible by using GPGPUs during training. In 2021, approximately 70% of the fastest 500 supercomputers in the world integrated NVIDIA GPGPUs [4].

The NVIDIA GPGPU architecture is depicted in the figure below. The primary chip comprises SMs (Streaming Multiprocessors). Each SM consists of multiple blocks of 32 cores, collectively called a warp. All 32 cores within a warp execute the same instruction. It is also essential to notice that a GPGPU is a co-processor; it requires a CPU to control it.

Considering all those cores, one may wonder, why use CPUs at all?

Source: https://www.scienceabc.com/innovation/what-is-a-gpu-how-exactly-does-it-help-in-running-high-graphic-games.html

Despite the advantages of GPUs for certain applications, they are not universally applicable due to the data-level parallelism requirement. Algorithms that lack such parallelism and operate on a single stream of instructions may not benefit from GPUs, especially considering that a single GPGPU core is typically slower than a single CPU core.

So the situation is simple, right? We use GPUs for algorithms that can have data parallelism and CPUs for everything else. But the story does not end here. For AI tasks heavily reliant on matrix multiplications,

GPGPUs serve as excellent accelerators. However, they still provide unused operations since they are designed to be general-purpose units catering to a wide range of problems.

To address this, Google introduced its ASIC (Application-Specific Integrated Circuit) chip, the TPU (Tensor Processing Unit) [5], specifically tailored for AI operations. Unlike GPUs, TPUs are designed for AI. The chip only provides matrix multiplications and other AI operations, with the number of internal units precisely tuned for such workloads. Unlike GPGPUs, TPUs are not designed to perform general operations. The TPU v4 architecture (as shown in the figure above) includes four matrix multiplication units, a scalar unit, and a vector unit per TensorCore. Currently, TPUs are exclusive to Google Cloud, requiring the adaptation (by Google) of widely-used software, such as TensorFlow and PyTorch, to utilize their capabilities.

TPU v4 architecture. Source: Google Cloud (https://cloud.google.com/tpu/docs/system-architecture-tpu-vm)

In conclusion, GPUs are fast for AI applications due to their ability to efficiently apply the same operation to multiple data simultaneously — a core principle driving their performance. However, not all problems can leverage this architecture’s parallelism, and some algorithms may only operate on a single stream of instructions, making GPUs less suitable in such cases. As technology continues to evolve, super-specialized units like TPUs are introduced to address specific challenges, expanding the range of computational power options available to the community for validation and adoption.

Source

[1] https://news.microsoft.com/source/features/innovation/openai-azure-supercomputer/

[2] John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach (5th. ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

[3] https://web.archive.org/web/20060701231201/http://intel.com/pressroom/archive/releases/20060626comp.htm

[4] https://blogs.nvidia.com/blog/2021/11/15/cloud-computing-ai-supercomputers/ [5] https://cloud.google.com/tpu/docs/system-architecture-tpu-vm

CPUs, GPUs, TPUs: What is powering AI under the hood?

Source

Written by Poatek