CODEX

Understanding the architecture of a GPU

Published in

CodeX

11 min readMar 25, 2021

Recently, in the story The evolution of a GPU: from gaming to computing, the hystorical evolution of CPUs and GPUs has been discussed and how the GPUs can be significantly more powerful than commercial CPUs underlined. We now ask ourself why PCs are still based on CPUs and not entirely made of GPUs.

The answer is that a CPUs work in a totally different way than GPUs and the figure below helps us understanding the main differences.

Left: CPU architecture; right: GPU architecture. Source: https://www.omnisci.com/technical-glossary/cpu-vs-gpu.

The color convention is that green represents the computational units, or cores, orange the memories and yellow the control units.

Computational units (cores)

At first glance, in a CPU, the computational units are “bigger”, but few in number, while, in a GPU, the computational units are “smaller”, but numerous. The size and the number evoke what a CPU or GPU cores are capable to do and their numerosity in a device.

A CPU core is faster and “smarter” than a GPU core.

Over the time, CPU cores have benefitted of a progressive increase in clock speed to improve performance (The evolution of a GPU: from gaming to computing). Opposite to that, GPUs have experienced clock slow downs to limit consumption and accommodate installations in mobile or embedded devices. A Jetson NANO installed on a robot for indoor mapping and navigation is a relevant example of the need of keeping at a minimum the power consumption to extend the battery life.(see, Indoor Mapping and Navigation Robot Build with ROS and Nvidia Jetson Nano):

A proof of the “smartness” of a CPU core is the capability to perform out-of-order executions. For the sake of optimization, a CPU can execute instructions in an order different from that they came in or it can predict the instructions most likely needed in the near future when encontering a branch (multiple branch prediction). In this way, it can prepare the operands and execute those in advance (speculative execution), thus saving time.

On the contrary, a GPU core doesn’t do anything that complicated and, anyway, it does not do much in terms of out-of-order execution. Roughly speaking, the house speciality of a GPU core is performing floating point operations like multiply-add (MAD) or fused multiply-add (FMA).

Multiply-Add (MAD) and Fused Multiply-Add (FMA) operations. Source: http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/SPRING-2011/Lect03.pdf.

As a matter of fact, the cores of most recent GPU architectures are not limited to FMA, but perform more complicated operations like tensor (tensor core) or ray tracing (ray tracing core) operations.

Tensor core. Source: https://www.anandtech.com/show/12673/titan-v-deep-learning-deep-dive/3.

Ray tracing core. Source: https://www.techspot.com/article/2109-nvidia-rtx-3080-ray-tracing-dlss/.

Tensor cores are aimed to serve tensor operations in artificial intelligence, while ray tracing cores are aimed to serve hyper-realistic real-time rendering.

Being simple or not, GPU cores do not reach the same flexibility of CPU cores. It is worth adding that the GPU programming model is SIMD (Single Instruction Multiple Data) meaning that all the cores execute exactly the same operation, but over different data. Evidently, the strenght of a GPU is not much in the processing capabilities of the cores, but in their massive parallelism.

Rowers rowing at the rate of battle. Source: https://247sports.com/college/kansas/Board/103726/Contents/Ben-Hur-is-getting-a-remake-71260436/.

The cores act somewhat like the rowers in a Roman Galea: a drummer sets the peace (clock) and the rowers row in parallel (compute) at the rate of battle.

The SIMD programming model allows accelerating a large class of applications. Scaling all the pixels of an image is an example. In this case, on mapping each pixel to a different core (assuming to have enough), then each core must simply scale a pixel and this can occur in a massively parallel way. While a sequential machine would solve the problem in N clock shots, where N is the number of pixels, the GPU at hand would solve the problem in only one clock shot assuming to have enough cores to cover the entire computational load. A problem like the scaling of an image is an embarrassingly parallel problem, that is, a problem in which one needs no effort to separate it into a number of parallel tasks. Indeed, the scaling of the individual pixel is totally independent on the scaling of the others. Nevertheless, to be run on a GPU, a problem does not need to be embarassingly parallel, but it is enough that it matches the SIMD computational scheme, namely, that it can be decomposed by repeating, at each instant of time, the same operation on different data.

Obviously, not every problem matches the SIMD model. This especially occurs for asynchronous problems, namely, for problems having no synchronous structure in which the processors need communicate with each other at any time and for which the computation structure can be very irregular and the load imbalanced.

Memories

From the first image of this story, we can discuss also about the differences between CPU and GPU in terms of memories.

The CPU memory system is based on a Dynamic Random Access Memory (DRAM) which, in desktop PCs, can be of some (e.g., 8) GBytes, but in servers can reach hundreds (e.g., 256) GBytes.

Another pillar of the CPU memory system is the cache which serves to reduce the time to access data from the DRAM. A cache is a smaller (e.g., tents of KBytes per core), faster memory, located closer to a processor core, which stores copies of the data allocated in the DRAM. Cache memory can have a hyerarchical organization, typically in three levels: L1 cache, L2 cache and L3 cache. The closer a cache is to the cores, the smaller, but the faster it is. For example, the L1 cache can be 64KBytes per core, the L2 cache of 256KBytes per core and the L3 cache of 4MBytes per core.

Assume to fetch a datum stored at the i100 address in DRAM. It will be moved from DRAM to cache along with its neighbooring elements, for example, the data stored at the i98, i99, i101 and i102 addresses. This is because it is supposed that, if address i100 was needed at a certain time, for next computations, the contents of i101 and i102 would also be needed (think of a for loop consecutively scanning the elements of an array). Thanks to cache, when a datum is needed, it is searched for in L1 cache first. If found, it is moved to CPU at a maximum speed. If not, it is searched for in L2 cache. If found, it is fetched at a high speed, although slower than L1. If not, the datum is sought for in L3 cache. Ultimately, if L3 cache misses the datum, the fetch occurs from DRAM. The amount of orange area in the figure ontop is a measure of the importance of DRAM and cache for a CPU.

From the figure ontop, the GPU is equipped with a DRAM, named as global memory or GMEM. GMEM is smaller than the DRAM of a CPU. In the cheapest cards, a couple of GBytes is available, while in the most performing ones, GMEM can be as large as 40 or 80GBytes. The limited size of GMEM was the first main criticism to the use of GPUs in scientific computing. Ten years ago, indeed, graphic cards were equipped with as few as 512MBytes, an issue that has been now fully overcome.

Concerning the caching, from figure ontop we can deduce that the caching mechanism is due to all those small orange rectangles on the left of the cores. However, caching presents differences with respect to CPU that will be shortly evidenced.

Understanding the GPU architecture

To fully understand the GPU architecture, let us take the chance to look again the first image in which the graphic card appears as a “sea” of computing cores.

In the image scaling example, the cores do not need to collaborate since their tasks are totally independent. Nevertheless, the problems that can be tackled with a GPU are not necessarily so simple. Let us convince ourself with an example.

Suppose to sum the elements of an array. A such operation belongs to the reduction family, since it amounts to “reduce” a sequence into a number. Summing the array elements appears at a first look as intrinsically sequential. We need to fetch the first element, sum it up to the second, take the result, sum it up to the third element, take again the new result, sum it up to the fourth element etc.

Sequential reduction. Source: https://www.eximiaco.tech/en/2019/06/10/implementing-parallel-reduction-in-cuda/.

Surprisingly, something appearing intrinsically sequential can be transformed in a parallel algorithm. Assuming an array of lenght 8, it will be enough to perform, in a first step, two-by-two summations in parallel, so obtaining 4 partial results. In a second step, the partial results will be summed up, again in a two-by-two fashion. Finally, the last 2 partial results will be summed up to get the final result. The described hyerarchical scheme is illustrated below:

Parallel reduction. Source: https://www.eximiaco.tech/en/2019/06/10/implementing-parallel-reduction-in-cuda/

The sum of 8 numbers thus requires three only steps, differently from the sequential case which requires 8. Generally speaking, to sum N numbers with N power of 2 (N=2ⁿ), n steps will be sufficient or, equivalently, log₂(N).

From the GPU point of view, assuming to number the cores from 0 to 3, namely, c0, c1, c2 and c3, in a first clock shot, all four cores will be employed, see figure below. In a second clock shot, cores c0 and c2 will exploit the partial results previously drawn by the four cores. The partial results on which c0 and c2 operate should be stored in a memory accessible by the involved cores. In a third clock shot, only core c0 will be active: it will sum up the results worked out by cores c0 and c2 at the previous steps. Also such partial results should be stored in some memory accessible by c0.

Parallel reduction with a GPU. Source: https://www.eximiaco.tech/en/2019/06/10/implementing-parallel-reduction-in-cuda/

The consequence of such reasoning is that the cores must be able to collaborate using a shared memory space on which storing/fetching partial results. Unfortunately, a GPU can host thousands of cores and it would be much difficult and expensive to enable each core to collaborate with all the others. For this reason, the GPU cores are organized in groups forming the Streaming Multiprocessors or SMs.

The ultimate GPU architecture

The architecture of GPUs for the Turing family is shown in the image below:

The Turing architecture. Source: https://bit-tech.net/features/tech/graphics/nvidias-turing-architecture-explained/2/.

The green parts are again computational units. The green blocks are the SMs and the yellow RT COREs are drawn close to them. The structure of an SM for the Turing architecture is reported below:

The Turing SM. Source: https://developer.nvidia.com/blog/nvidia-turing-architecture-in-depth/.

Within the Turing SM, among the green parts, we distinguish different kind of cores.

FP32 Cores. They perform single precision floating point operations. Considering the TU102 card, it has 64 FP32 cores per SM. Since we have 72 SMs, then the overall number of FP32s in the card is 4608.

FP64 Cores. Actually, among the floating point cores, the presence of 2 FP64 cores per SM performing double precision floating point operations should be mentioned, although the FP64 cores are not included in the above image.

Integer Cores. These are cores performing operations on integers (for example, address computations) and that can execute instructions concurrently with the floating-point math datapath. In previous GPU generations, executing these instructions would have stalled the floating point pipeline any time non floating point operations were needed. In TU102, 4608 integer cores are present, 64 per SM.

Tensor Cores. Tensor Cores are groupings of FP16 units, namely, half-precision units, devoted to tensor computation accelerating common Deep Learning operations. Turing Tensor Cores also perform INT8 and INT4 precision operations for workloads that can tolerate quantization and don’t require FP16 precision. In TU102, we have 8 Tensor Cores per SM, 576 overall on the card.

After having described the execution part of GPUs in broad lines, let’s get to the issue raised above and concerning collaboration.

In the bottom part of the SM, a L1 cache/shared memory is present. This is the memory by which the cores can cowork. Each SM has a single, dedicated L1 cache/shared memory. Being the L1 cache/shared memory on-chip, it has limited size (96KBytes for the Turing architecture), but it is very fast, surely much faster than GMEM.

Actually, the L1 chache/shared memory has the double function of cache for the GMEM accesses and of shared memory. When the cores need to cowork and exchange partial results, the coder programs the threads to store the partial results in shared memory so that they can be subsequently fetched. The other scope of this memory is caching. When the cores need to access GMEM, the data are first searched for in the L1 cache. If not found, they are searched for in the L2 cache which is transversal to all the SMs. The L2 cache is bigger, but slower, than L1 cache. If the data are not found in L2, then they are fetched from GMEM. The data in the cache persist unless they are evicted by “fresher” data. From this point of view, if the data need to be accessed many times from GMEM, the programmer has the right to bind them in shared memory to speed up their fetching. Shared memory can be then regarded as a controlled cache. As a matter of fact, L1 cache and shared memory are obtained from the same circuitry and the programmer has the right to decide if the card must devote more memory to caching or to shared memory.

Last, but not least, the memories that can be used for storing the partial results are not limited to shared memory. Registers, indeed, represent the closest and fastest, but smallest, memory to cores (see the register file in the image above). The underlying idea is that each thread can have a register at its disposal in which storing temorary results. Each register is visible by only a single thread or by threads in the same warp or group of 32 consecutive threads.

In PyCUDA, Google Colab and the GPU, we will get started with programming on GPUs using PyCUDA on a freely-available, browser-based platform. Also, in Running CUDA in Google Colab, we will show how running CUDA codes under the Google Colab environment.

CODEX

Understanding the architecture of a GPU

Written by Vitality Learning