NVIDIA GPUs story

Published in

Computing Systems and Hardware for emerging applications (especially Machine Learning, Deep Learning)

14 min readMar 25, 2024

In this comprehensive post, I’ve meticulously compiled the captivating journey of NVIDIA GPUs, tracing their humble beginnings to their current prominence in the technological landscape. This narrative serves as an invaluable resource for individuals seeking a deeper understanding of NVIDIA GPUs, whether they’re embarking on a journey to work with these powerful processors, delve into CUDA programming, or simply satisfy their curiosity.

First of all, I will start with the CPUs. CPUs are sequential processors that process instructions after each other. There have been advancements that made modern CPUs capable of processing more than one instruction and more than one thread concurrently. CPUs are good at executing programs with limited parallelism and heavy calculations, like mathematical calculations and serial sorting algorithms. But, for example, consider a simple image processing task, where there are 1024*1024*3 pixels in need of processing at once for rendering purposes, CPUs make it slow by processing it serially (or semi-serially, I say this because CPUs can do faster by exploiting Instruction Level Parallelism (ILP) in a thread and executing more than one thread simultaneously). So, it was where the computer architects thought of having a co-processor that offers higher parallelism but doing simple processing like each pixel’s value is changed under a gray-scale filter and is only dedicated to graphics processing.

Early GPUs were specific-function co-processors only designed for Graphics Processing. They were fixed-function Graphics Pipeline.

GPUs were only computing the color of each pixel, which is mentioned as Rendering too. Programming them was possible by using Graphics APIs like DirectX and OpenGL. However, it was not straightforward since the knowledge of computer graphics and mapping the problem and data to an image required employing GPUs to perform the computation (e.g., scientific computation).

Transformation Layers to execute graphics processing (or mapped other computation) on a conventional GPU

Before delving into more details of the early GPUs’ hardware structure, we need to define some terminology and make sure that we all are on the same page. The terms include shader, vertex, and pixel.

Shader: a computer program performing graphics-related tasks
vertex: a data structure describing a certain attribute, e.g., the position of a point in 2D or 3D space, or multiple points on a surface.
Vertex shader: a program that transforms each vertex’s 3D position in virtual space to the 2D coordinate at which it appears on the screen.
Pixel or fragment shader: a program that computes the color, brightness, contrast, and other attributes of a single pixel or fragment.

In conventional GPUs, units responsible for processing vertex and fragment shaders were separate. In other words, GPUs’ architecture was separate stages of vertex and pixel (or fragment) processors (processing/ computing units). The following figure shows the architecture of an early GPU. To learn about the details of vertex and fragment processing unit architecture, read the “GeForce 6800” paper from J. Montrym and Henry P. Moreton.

The following table shows the different microarchitectures of GPUs from NVIDIA, from 1998 to 2004 with specifications like the number of transistors, fabrication process, supported OpenGL and DirectX versions, and the GPU cards that are powered with a GPU chip. None of these GPUs are programmable with CUDA. Moore’s law effect is evident in the table on how the number of transistors increases over the years.

Category:Nvidia microarchitectures — Wikipedia and other related Wikipedia pages for gathering the info

Notes:
GeForce 256 GPU card is usually referred to as “The First GPU”. It has a fixed-function 32-bit floating-point vertex transform and lighting processor with a fixed-function integer pixel-fragment pipeline, which was programmed with OpenGL and the Microsoft DX7 API
GeForce 3 series introduced the first programmable vertex processor executing vertex shader along with a configurable 32-bit floating-point fragment pipeline programmed with DX8 and OpenGL.

The following table shows the GPU microarchitecture that is programmable with CUDA.

To start with CUDA: CUDA stands for Compute Unified Device Architecture and is a parallel computing framework platform and API allowing software to use certain types of NVIDIA GPUs for general-purpose computing [2]. CUDA programming is possible for C/C++ programmers can use ‘CUDA C/C++’ for programming NVIDIA GPUs, in which the code gets compiled to PTX with the NVCC compiler.

1. PTX stands for Parallel Thread Execution, a low-level parallel thread execution virtual machine and instruction set architecture (ISA) used in the NVIDIA CUDA environment. Programming GPUs in PTX is like developing assembly programs for CPUs.
2. NVCC stands for NVidia Cuda Compiler, an LLVM-based compiler

The following figure shows how a CUDA C/C++ program gets compiled by NVCC.

To have a deeper understanding of what happens to a CUDA C/C++ program before becoming executable, consider the following figure.

Tesla 2006, The big change in NVIDIA GPUs’ microarchitecture

In 2006, NVIDIA introduced a new microarchitecture implementing a unified shader model. The goal was to address the complexity problem for the sake of better management of hardware resource utilization and load balancing between vertex and fragment (pixel) stages. An unbalanced load of work was happening between vertex and fragment processors and the reason was because of the non-matching number of pixels and vertices. With the unified shader model, there was no need for having different stages for vertices and pixels. The hardware was altered in a way to be used for all fragment processing, which simplifies the design of the GPU. Following that, cores inside the GPU became simpler. Cores became sequential, and scalar, which were able to work on only one computation task at a time. These cores are called CUDA cores nowadays. These cores were grouped under the streaming multiprocessor (SM) name (a bigger umbrella) that replaced the stages of vertex and fragment units. Load-balancing challenge was made easier just by swapping kernels. Each SM receives threads in groups of 32 threads, which are called warps. All threads in a warp execute the same instruction at the same time but on different data. This is the reason it is called Single Instruction Multiple Thread (SIMT). A saying from the senior vice president of GPU engineering at NVIDIA, Jonah Alben about the Tesla microarchitecture.

We pretty much threw out the entire shader architecture from NV30/NV40 and made a new one from scratch with a new general processor architecture (SIMT), that also introduced new processor design methodologies.

The following figure shows the architecture of NVIDIA Tesla 2006.

The following figure shows a TPC (Thread Processing Cluster) with more detail.

And the following one shows an SM with more details.

The execution model of GPUs after Tesla (or CUDA-powered)

Each grid (a cuda kernel) contains multiple thread blocks (their other name: CTA standing for Cooperative Thread Array) in a grid. Each CTA has a specific number of threads. When a CUDA program is launched for execution on the GPU, it is not deterministic which SM will serve which CTA. The following figure shows how the execution happens.

Warp scheduling

A warp is a group of threads with consecutive thread IDs. The philosophy of having warps designed is to have lighter and easier hardware scheduling. The number of threads in a warp is specified by the architecture of the hardware. Note that GPU parallel programmers do not need to know anything about warps! The following figure shows how warp scheduling makes life easier and simpler in terms of hardware scheduling.

SIMT architecture’s challenge: Divergence

It happens when some of the threads in a warp need to execute one instruction and the rest must execute another instruction. This can happen because of a branch in the code as the following figure shows. This is challenging in terms of performance because it limits the parallelism and serializes the execution of those groups of threads as the figure shows.

The divergence causes serialization that degrades the performance [5]

How do GPUs differ from CPUs on avoiding stalls?

CPUs tries to be faster on executing a program by employing techniques like cache, speculative execution, and etc. But, GPUs saturate the memory bus to satisfy thousands of threads. That is why we hear a lot about performance through high throughput when it comes to GPUs. GPUs are well-known for “Devices suitable for throughput-oriented programs”. On the contrary, CPUs target latency-oriented programs.

Fermi microarchitecture (2010)

In this microarchitecture, NVIDIA computer architects doubled the number of streaming processors (SPs, other name: CUDA cores) inside each straming multiprocessor (SM), which enables two half-warps (16 threads) to be on the GPU. This microarchitecture was not powered by cores able to do 64-bit floating-point operations, but the operation was made possible by combining two CUDA cores and do it. Arithmetic logic unit (ALU) changed to a 32-bit one (Tesla’s was 24-bit). Also, Fermi added more C++ features. TPC was replaced with graphics processor cluster (GPC). The following figure shows the overall architecture of Fermi.

Giga Thread engine in the figure is the unit responsible for scheduling thread blocks (CTAs) to SMs. The following figure shows the structure of a SM in Fermi. The dual warp scheduler selects two warps and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 special function units (SFUs).

As shown in the figure above, Shared memory/ L1 cache can be used either to cache data for individual threads (for register spilling/ L1 cache) and/or to share data among threads of a thread block.

Kepler microarchitecture (2012)

In Kepler, the focus was on energy efficiency. So, the critical decisions made were lower clock frequency, and unifying the core clock with the GPU card frequency. Hyper-Q/ MPS (Multi-Process Service) technology introduced with Kepler, which made it possible to more than only one CPU thread to launch kernels on a single GPU simultaneously. The aim was to increase the GPU utilization. They removed hardware scheduler in favor of the software one. Also, more SMs packed, which each of them continued more resources. They coined a new abbreviation for SM: “SMX” standing for the next generation streaming multiprocessor. The most important one was dynamic parallelism. It was an extension to the CUDA, which enabled a CUDA kernel to create new thread grids by launching new kernels. In the earlier GPUs, kernels could be launched only from the host (CPU) code. It was challenging to not be able to create new work from inside a kernel because algorithms with recursion, irregular loop structures, time-space variations were not fit for a flat, single level of parallelism. They must have been implemented with multiple kernel launches. The advantage of dynamic parallelism was decreasing host burden and host-device communication. The following figure shows the difference between GPUs with dynamic parallelism and the older ones without dynamic parallelism.

kepler’s SM became more capable compared to the Fermi’s one. It had four warp schedulers able to process a whole warp in one clock, which Fermi was able to process a half-warp at a time. Each scheduler got equipped with double dispatch units to execute the second instruction in a warp if it was independent from the current instruction. But, it was always possible since one column of 32 CUDA cores was shared by two dispatchers. Check out Kepler’s SMX structure with the overall structure of Kepler microarchitecture below.

Maxwell microarchitecture (2014)

Maxwell also has a focus on power efficiency and it decreases the number of CUDA cores from 196 in Kepler SMX to 128 in its SMM to align the number of cores with warp size. It improves the die partitioning and causes area and power save. Maxwell has adopted a simpler scheduling logic compared to Kepler. Maxwell increased the size of level 2 cache from 256KB to 2MB. The following figure shows the overall architecture of Maxwell and the architecture of SMM.

Pascal Microarchitecture (2016)

In Pascal, SMs did not change compared to the previous generation. The 16nm process allowed packing more transistors on the same die area. These transistors employed in increasing the size of register file (RF) and shared memory (or scratchpad memory). Unified memory introduced, which enabled CPU and GPU access same memory address space with the help of a “Page Migration Machine (PMM).” Also NVLink was introduced with Pascal. It is a high bandwidth wire-based communication protocol faster than PCIe, which stands for NVidia Link. It became possible for each GPU to have multiple NVLinks. Also, GPUs could use mesh networking to communicate instead of connecting to a central hub for data exchange. It is important to meniton that instruction and thread level preemption became possible from this microarchitecture.

what is instruction and thread-level preemption? Read more here [Breaking Down Barriers — Part 4: GPU Preemption — The Danger Zone (wordpress.com)]

Furthermore, HBM 2 memory (3D-stacked memory) is used as the GPU main memory.

The following figure shows the overall architecture of Pascal.

Volta (workstation) microarchitecture 2017

With the introduction of Volta, NVIDIA revealed the NVLink version 2.0 and 1st generation tensor cores added to SMs. A tensor core multiplies two 4*4 16-bit matrices, then adds a third 16-bit or 32-bit matrix to the result by using fused multiply-add (FMA) operation and obtains an 32-bit floating-point result that can be optimaly demoted to an 16-bit floating-point result. In deed tensor core implement the the idea of systolic array dating back to 1980 (need to be checked and referenced). A systolic array is an array of processing elements (PEs) orchestrating flow of data transforming a piece of data before outputting it to memory, which balances computation, memory, and I/O bandwidth. The tensorcores’s integration in GPUs’ SM were accelerating the training and inference processes of deep learning applications. The following figure shows what a tensor core does.

Turing (consumer) microarchitecture 2018

Big changes happened in the Turing microarchitecture tending toward (1) artificial intelligence (AI) by adding tensor-cores to SMs, (2) image processing by adding raytracing cores to SMs. Turing is very similar to to pre-Tesla architectures as it has a layered architecture, which the following figure shows it.

The three major changes that happened in Turing were:

SPs or CUDA cores became superscalar processors able to execute integer and float instrcutions parallely like we see in CPUs, for instance Intel’s Pentium
GDDR6X memory subsystem with 16 memory controllers provided sgnificantly large bandwidth
Threads in warp do not share their Instruction Pointer (IP) because each thread has its IP, so they can be scheduled independently, which it means fine-grain scheduling (thread level scheduling in comparison to warp level scheduling) with more hardware overhead.

Ampere microarchitecture (2020)

NVIDIA quoted on the Ampere architecture as follows:

Ampere is designed for the age of elastic computing, delivers the next giant leap by providing unmatched acceleration at every scale.

Just for the sake of learning:

Elastice computing is a cloud service feature that allows users to adjust the amount of resources they use, such as CPU, memory, storage, and bandwidth, according to their changing needs and workload demands [6].

With the Ampere, NVIDIA introduced the 3rd generation tensor cores as well. The new tensor cores (TCs) are sparsity optimized, which means they handle sparse calculations faster and in an optimized manner. 32-bit tensor float and 64-bit floating point (FP64) provides higher precisions for high performance computing purposes and up to 20X speeup for AI applications. The following image shows how tensor cores can increase the computation performance. For learning more about the difference refer to the NVIDIA page on their tensor cores[7].

With Ampere microarchitecture, NVIDIA introduced a new feature called Multi Instance GPU (MIG). This feature enables each GPU to be partitioned into multiple smaller GPU instance. These smaller GPU instances are secured and isolated from each other at the hardware level with their own high-bandwidth memory, cache, and compute cores. The following figure shows a GPU can be divided into smaller GPUs with completely isolated compute and memory resources.

image credit [NVIDIA MIG user guide page]

If you want to delve into more details about MIG feature, check this post.

Multi-Instance GPU (MIG) of NVIDIA GPUs | by Ehsan Yousefzadeh-Asl-Miandoab | Medium

Furthermore, they introduced NVLink version 3 that provides 10X higher speed compared to PCIe. With this microarchitecture, more improvements came to the RayTracing (RT) cores, which are used especifically in graphics processing. Also, it is important to pay attention that more memory bandwidth and enormously larger level 2 cache (7X larger than the precious generation) was brought into the GPUs.

Hopper microarchitecture (2022) — data centers

This architecture compared to the Ampere microarchitecture, improves different components like memory system to HBM2, second-generation MIG technology, new confidential computing support, 4th generation NVIDIA NVLink, 3rd generation NVSwitch, new NVLink Switch system, PCIe Gen 5. The new SM improved its performance by introducing 4th generation tensor cores, new DPX (dynamic programming) instructions, faster IEEE 64-bit and 32-bit floating-point calculations, new thread block cluster, distributed shared memory, and new asynchronous execution features including tensor memory accelerator (TMA) according to the NVIDIA blog post [8].

Ada Lovelace microarchitecture (2022) — workstations

This microarchitecture was designed for consumer purposes (workstations) and as NVIDIA describes it:

NVIDIA Ada Lovelace Architecture
Designed to deliver outstanding gaming and creating, professional graphics, AI, and compute performance.

This microarchitecture (or architecture) is equipped with 4th generation tensor cores, 3rd generation ray tracing cores, shader execution reordering (SER) technology, and new 8th generation NVIDIA encoders (NVENC) with AV1 encoding based on NVIDIA page [9].

Conclusion

In this post, I’ve crafted a timeline detailing the microarchitecture (also known as architecture) of NVIDIA GPUs along with their key features. My intention is to refine and enhance this post by ensuring coherence, reviewing and potentially adjusting details, and rectifying any inaccuracies.

References

[1] Montrym, J. & Moreton, Henry, “The GeForce 6800,” Micro (2005), IEEE. 25. 41–51. 10.1109/MM.2005.37.

[2] Wikipedia contributors. (2024, February 29). CUDA. In Wikipedia, The Free Encyclopedia. Retrieved 19:01, March 3, 2024, from https://en.wikipedia.org/w/index.php?title=CUDA&oldid=1210964651

[3] GitHub — ehsanyousefzadehasl/PCwGPGPUs: Parallel Computing with GPGPUs

[4] Lindholm, E., Nickolls, J.R., Oberman, S.F., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28.

[5] CUDA C++ Programming Guide (nvidia.com)

[6] What Is Elastic Computing? Definition, Examples, and Best Practices — Spiceworks

[7] Tensor Cores: Versatility for HPC & AI | NVIDIA

[8] NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

[9] The NVIDIA Ada Lovelace Architecture | NVIDIA

Nvidia GPUs through the ages: The history of Nvidia’s graphics cards (pocket-lint.com)