GPU Compute and Memory Architecture

Tushar Gautam
12 min readJul 11, 2024

--

In the previous blog post, I gave a high-level overview of CPUs, GPUs, and the GPGPU programming model (CUDA). Using the vector addition example, I explained how to write functions that can run directly on the GPU. Ultimately, the goal is to speed up applications and decrease their execution time. By writing a simple function to get high-resolution time and wrapping it around the CPU and GPU vector addition, I can get their respective execution times (Figure 1).

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long myCPUTimer()
{
struct timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec);
}


int main(int argc, char const *argv[])
{
// Length of arrays
int N = 10000000;

// Array definition and allocation
.
.
.

// Vector addition on a CPU
unsigned long long t1_cpu = myCPUTimer();
vec_add_cpu(A, B, C, N);
unsigned long long t2_cpu = myCPUTimer();
printf("CPU execution time: %llu milliseconds \n", t2_cpu-t1_cpu);

// Vector addition on a GPU
unsigned long long t1_gpu = myCPUTimer();
vec_add_gpu(A, B, C, N);
unsigned long long t2_gpu = myCPUTimer();
printf("GPU execution time: %llu milliseconds \n", t2_gpu-t1_gpu);

// Verifying and printing results
.
.
.
}
Figure 1: Execution time (in milliseconds) when performing vector addition using the vectors of length 10 million on CPU and GPU

This is not what I want. Even for such a large problem (vectors of 10 million in length), GPU execution time is magnitudes higher than CPU, and I have no idea why this is the case. The main objective of this blog post is to explain different hardware concepts essential for CUDA C programmers to understand and reason about the performance of their code. I will do this by answering five simple questions.

  1. What is the architecture of a modern GPU?
  2. What are the different memory types in a GPU?
  3. How do CUDA blocks/threads work with different GPU hardware components?
  4. What are the common coding practices to get the best out of a GPU?
  5. Is there a way to determine the resources available for a GPU?

Modern GPU Architecture

Figure 2 shows a high-level CUDA C programmer’s view of a CUDA-capable GPU’s architecture. There are four key features in this architecture:

  • The GPU is organized into an array of highly threaded streaming multiprocessors (SMs).
  • Each SM has several processing units called streaming processors or CUDA cores (shown as green tiles inside SMs) that share control logic (shown as a gray rectangle named Control inside SM).
  • The SMs also contain a different on-chip memory (shown as a gray rectangle named Memory inside SMs) shared amongst the CUDA cores inside the SM.
  • GPU also has a much larger off-chip memory, also known as the global memory or VRAM.

Note that CUDA cores in different SMs can’t interact with each other or access memory from other SMs.

As a programmer, it’s important to understand the relationship between programming constructs like grid/blocks/threads and different memory components. It is also important to know about the properties (like size, access latency, etc.) of these memory units.

There are five types of memory in a CUDA device:

Global Memory

  • It can be written and read by the host and device.
  • It has long latency and low bandwidth.
  • Large memory space that resides off-chip.
  • It can be accessed by all threads.

Local Memory (placed inside global memory)

  • It is similar to global memory but can only be used by a single thread.
  • Each thread has its share of local memory where it can place data private to the select thread.

Constant Memory

  • It can be written and read by the host.
  • It can only be read by the device.
  • It has short latency and high bandwidth.
  • Small memory space (~64 KB) that resides off-chip.
  • It can be accessed by all threads.

Shared Memory

  • It can be written and read by the threads within a block.
  • It has a short latency and high bandwidth.
  • Very small memory space (~16 KB per SM) that resides on-chip.
  • It can be accessed by all threads in a block.

Registers

  • It can be written and read by an individual thread.
  • Extremely fast memory (orders of magnitude faster than Global and Shared memory).
  • Very small memory space (~8 KB per SM) that resides on-chip.
  • It is private to each thread.

Latency is the time it takes for data to travel from one point to another, usually measured in milliseconds (ms). Bandwidth is the maximum amount of data that can be sent and received in a given time, usually measured in Mbps or Gbps.

All this might feel a bit too overwhelming. So, here are the key points to remember when writing an application:

  • When copying data from the host, it goes into global memory. Accessing data from global memory takes a long time.
  • Register memory access is very fast (usually two orders of magnitude higher than global memory).
  • Shared memory access is fast (however, slower than a register).

All memories have different functionalities, latencies, and bandwidth. It is important to understand how to declare a variable so that it will reside in the intended type of memory.

Hardware-Software Interactions

Thread Blocks and SMs

When a kernel is called, the CUDA runtime system launches a grid of threads that execute the same kernel code. These threads are assigned to SMs on a block-by-block basis, i.e., all threads in a block are simultaneously assigned to the same SM. Consider the vector addition example where vectors are of length 17, and a grid is launched such that each block has 4 threads (with a total of 5 such blocks). Multiple blocks will likely get simultaneously assigned to the same SM. For example, 2 different SMs could end up with 2 and 3 blocks, respectively.

Blocks must reserve hardware resources (like registers, shared memory, etc.) to execute. Hence, a limited number of blocks can be simultaneously assigned to an individual SM.

The example discussed above is quite small. In real-world problems, there are a lot more blocks, and to ensure that all blocks get executed, the runtime system maintains a list of blocks that did not get assigned to any SM and assigns these new blocks to SMs when previously assigned blocks complete execution. This block-by-block assignment of threads guarantees that threads in the same block are executed simultaneously on the same SM, which:

  • Makes interaction between threads in the same block possible.
  • Allows (shared) memory to be shared between threads in the same block.

For a moment, it might look like an odd choice not to let threads in different blocks interact with each other. However, this feature allows different blocks to run independently in any order, resulting in transparent scalability where the same code can run on different hardware with different execution resources. This, in turn, reduces the burden on software developers and ensures that with new generations of hardware, the application will speed up consistently without errors.

Warps

In the previous section, I explained that blocks can execute in any order relative to each other, but I did not say anything about how threads inside each block are executed. Conceptually, the programmer should assume that threads in a block can execute in any order, and the correctness of the algorithm should not depend on the order in which threads are executed.

Thread scheduling in CUDA GPUs is a hardware implementation concept that varies depending on the type of hardware used. In most implementations, once a block is assigned to an SM, it is divided into 32-thread units called warps. The knowledge of warps is useful for understanding and optimizing the performance of CUDA applications.

Warp size can be different for future hardware generations!

Each warp consists of 32 consecutive threads. For 1D block, this is straightforward such that threadIdx.x is used to divide the threads into warps.

If the number of threads in a block is not a multiple of 32, the last warp gets padded with inactive threads to fill up the 32 thread positions.

An SM is designed to execute all threads in a warp following the SIMD model, i.e., at any instance in time, one instruction is fetched and executed for all threads in the warp. As one instruction is shared across multiple execution units, it allows for a small percentage of the hardware to be dedicated to control, and a large percentage is dedicated to increasing the arithmetic throughput (i.e., cores).

Control Divergence

SIMD execution on a warp suggests that for optimum results, all threads in a warp must follow the same execution path or control flow. An example of this is an if-else statement where the first 24 threads in a warp take the if route, and the remaining 8 take the else route. In a situation like this, hardware will take multiple passes through these diverging paths (i.e., one for if and one for else), adding to the execution resources used by inactive threads in each pass.

When threads in the same warp follow different execution paths, then it’s said that the threads exhibit control divergence. It is also important to note that as the size of data increases, the performance impact of thread divergence decreases. For example, in vector addition:

  • When the vector size is 100, 1 of the 4 warps will have control divergence, i.e., it affects 25% of the execution time.
  • When the vector size is 1000, only 1 of the 32 warps will have control divergence, i.e., it affects ~3% of the execution time.

In Pascal and prior architectures, these passes were sequential. However, from the Volta architecture onwards, the passes may be executed concurrently (but there’s no guarantee), and this feature is known as independent thread scheduling.

Warp Scheduling and Latency Tolerance

There are usually more threads assigned to an SM than its cores. This is done so that GPUs can tolerate long-latency operations (like global memory accesses). With enough warps, SM can find a warp to execute while others are waiting for long-latency operations (like getting data from global memory). Filling the latency time of operations from some threads with work from others is called latency tolerance or latency hiding. The selection of warps ready for execution does not introduce any computational cost because GPU hardware is designed to facilitate zero-overhead thread scheduling.

This context-switching can also be done with CPU threads. In CPUs, before switching to a different thread, the current execution state is moved from registers to memory and later loaded back to registers (from memory). This saving and restoring contents in a CPU can incur significant overhead.

GPUs perform context switching differently, such that the execution state is held in place (inside thread registers). The ability to tolerate such long latency operations is why GPUs have smaller cache memories and control logic but way more registers. Another distinction from CPU is that GPU registers are not designed to store data, but they act as a staging area for threads.

There is a limit (set by CUDA) to the number of warps that can be assigned to an SM. However, it’s not possible to assign an SM with the maximum number of warps that it supports because of constraints on execution resources (like registers and shared memory) in an SM. The resources are dynamically partitioned such that SMs can execute many blocks with few threads or a few blocks with many threads.

For example, an Ampere A100 GPU can support

  • 32 blocks per SM
  • 64 warps (2048 threads) per SM
  • 1024 threads per block

If a grid is launched with 1024 threads in a block (maximum allowed)

  • Each SM can accommodate 2 blocks (with 2048 threads total, matching the maximum allowed per SM).

If a grid is launched with 512 threads in a block

  • Each SM can accommodate 4 blocks (with 2048 threads total, matching the maximum allowed per SM).

If a grid is launched with 256 threads in a block

  • Each SM can accommodate 8 blocks (with 2048 threads total, matching the maximum allowed per SM).

If a grid is launched with 64 threads in a block

  • Each SM can accommodate 32 blocks (with 2048 threads total, matching the maximum allowed per SM).

The ratio of the number of warps assigned to an SM to the maximum number it supports is known as occupancy.

A negative situation might arise when the maximum number of threads allowed per block is not divisible by the block size. For example, an Ampere A100 GPU can support 2048 threads per SM

So, if a grid is launched with 700 threads in a block

  • SM can hold only 2 blocks (totaling 1400 threads), and the remaining 648 thread slots are unutilized.
  • The occupancy in this case is 1400 (assigned threads) / 2048 (maximum threads) = 68.35%.

I did not consider the impact of other resources like memory. Automatic variables declared in a CUDA kernel are stored in registers, and that might not allow a large number of blocks to be allocated for execution to an SM.

Query Device Properties

The Compute capability of a GPU indicates the amount of resources available (generally, larger compute capability means larger compute resources). I can find out the available resources for a GPU using the following commands:

  • Number of available GPUs: cudaGetDeviceCount()
int dev_count;
cudaGetDeviceCount(&dev_count);
  • Device properties of a GPU: All available GPUs are numbered 0 to dev_count - 1. API function cudaGetDeviceProperties() can be used to get the properties of a select GPU.
cudaDeviceProp dev_prop;
for(unsigned int i = 0; i < dev_count; i++)
{
cudaGetDeviceProperties(&dev_prop, i);

// Further details of different properties
.
.
.
}

For the select device (in the loop), I can get the following device properties:

  • Max number of threads allowed in a block: dev_prop.maxThreadsPerBlock
  • Number of SMs in the GPU: dev_prop.multiProcessorCount
  • The Clock frequency of the GPU cores: dev_prop.clockRate
  • Max number of threads allowed in each dimension of a block: X Dimension: dev_prop.maxThreadsDim[0] , Y Dimension: dev_prop.maxThreadsDim[1] , Z Dimension: dev_prop.maxThreadsDim[2]
  • Max number of blocks allowed in each dimension of a grid: X Dimension: dev_prop.maxGridSize[0] , Y Dimension: dev_prop.maxThreadsDim[1] , Z Dimension: dev_prop.maxThreadsDim[2]
  • Number of available registers in each SM: dev_prop.regsPerBlock
  • Warp size: dev_prop.warpSize

Conclusions

Now that I’ve explained the details related to the GPU architecture. I will summarize everything by answering the questions asked at the start of this blog post.

  • What is the architecture of a modern GPU?

Ans. A modern GPU has three main components: Streaming Processors (CUDA cores), Memory, and Control. CUDA cores are grouped into multiple Streaming Processors, and memory is divided into registers, shared memory, and global memory.

  • What are the different memory types in a GPU?

Ans. There are five types of memory in a CUDA device: Global Memory, Local Memory, Constant Memory, Shared Memory, and Registers.

  • How do CUDA blocks/threads work with different GPU hardware components?

Ans. When a kernel is called, all threads in a block are simultaneously assigned to the same SM. Once a block is assigned to an SM, it is divided into 32-thread units called warps. There are usually more threads assigned to an SM than its cores. This is done so that GPUs can tolerate long-latency operations (like global memory accesses).

  • What are the common coding practices to get the best out of a GPU?

Ans. SIMD execution on a warp suggests that for optimum results, all threads in a warp must follow the same execution path or control flow, i.e., there should not be any control divergence of threads.

  • Is there a way to determine the resources available for a GPU?

Ans. Several commands (predefined CUDA functions) can be used to determine the available resources for a GPU.

Next, I will demonstrate how we can improve the performance of matrix multiplication by using the concepts learned in this blog post. If you want to stay updated with the latest blog posts, please subscribe to my email newsletter at 0mean1sigma.com.

I do not spam anyone with unnecessary emails, and you will only receive emails when I publish a blog post (usually once every 7–10 days).

I also have a YouTube channel (0Mean1Sigma) where I post similar videos every 1–2 months.

References

--

--

Tushar Gautam
0 Followers

Machine Learning and High Performance Computing