CUDA — GPU Device Architecture

Raj Prasanna Ponnuraj
Analytics Vidhya
Published in
4 min readSep 25, 2020
NVIDIA Turing Architecture

In this post we shall talk about the basic architecture of NVIDIA GPU and how the available resources can be optimally used for parallel programming. This post is part 3 in the sequel. The link to previous posts are Part 1 and Part 2.

I’m using NVIDIA GeForce GTX 1650 GPU which belongs to the NVIDIA Turing architecture. So, most details shared in this post belong to the Turing architecture however I’ll try to generalise the concepts.

The compute cores in a GPU are grouped into a unit called Streaming Multiprocessor (SM in short). In GTX 1650 there are 14 SMs each having 64 CUDA cores (FP 32 cores) and 64 INT cores. Each SM has its own warp schedulers, dispatchers, registers, shared memory and L1 cache.

Streaming Multiprocessor

The instruction architecture of GPU is Single Instruction Multiple Threads (SIMT). The threads are executed in a collection called warp. Warp is the basic unit of execution in a GPU. Generally, the number of threads in a warp (warp size) is 32. Even if one thread is to be processed, a warp of 32 threads is launched by warp scheduler where the 1 thread will be the active thread. Hence, we should make sure that all the threads in a warp are active for better utilisation of the GPU resources.

Based on the readiness of the warp it is classified in to three:

  1. Selected warp — warp that is actively executing
  2. Eligible warp — warp ready for execution with all its arguments available but awaiting execution
  3. Stalled warp — warp not ready for execution

Memory Hierarchy

Pic Courtesy ResearchGate

Registers — the registers are 32 bits and maximum of 255 registers can be allocated to a thread. In total 64 kB of register file is available per SM.

Shared Memory — shared memory is allocated to thread blocks and 64 kB of shared memory is available per SM.

The registers and shared memory have very less memory transfer latency when compared with global memory transfer latency. Hence, we should utilise the available registers and shared memory optimally for latency hiding.

Latency Hiding

There are two types of latency — compute latency and memory transfer latency.

Compute latency

In Volta and Turing architecture, core mathematical operations take 4 clock cycles to execute. So, we need 4 warps per warp scheduler in the pipeline to hide this latency. If there are 4 warp schedulers per SM, then we need 16 warps or 512 threads (16 x 32) for 100% utilisation of compute cores.

In GTX 1650, there are 14 SMs therefore theoretically 224 (14 x 16) warps are needed to hide compute latency.

Memory transfer latency

We can assume global memory transfer latency as 350 clock cycles.

We need to know the memory bandwidth and memory clock rate of our device to calculate the number of warps required to hide the above latency. This can obtained using the below command in CLI

$nvidia-smi -a -q -d CLOCK

The GPU memory clock rate in my system 4 GHz GTX 1650 GDDR5 has memory bandwidth of 128 GB/s. With these values we get 32 B/clock cycle.

So in 350 clock cycles, 11200 Bytes can be transferred. Considering FP32 (4 Bytes) and 2 operands, we need 4 warps per SM (11200/32/2/14/32) to hide the memory transfer latency.

Occupancy

Occupancy is the ratio between Active warps per SM and Maximum allowed warps per SM.

If a kernel is compute bound or memory bound then increasing the occupancy ratio would possibly increase the performance. Occupancy of our kernel can be calculated as follows,

Calculate the registers and shared memory usage of our kernel with the following command

$nvcc --ptxas-options=-v -o output.exe cuda_file.cu

Use the values obtained in the Occupancy Calculator tool from NVIDIA. The tool is basically an excel file with built-in macros. The tool can be downloaded from the following link.

Caution — If the kernel is not compute bound or memory bound, then increasing the occupancy will not necessarily improve performance. At times it may degrade the performance by adding additional instructions or divergent code.

In the next part, I’ll talk about warp divergence and few ways to minimise warp divergence. We shall also discuss how NVIDIA NSIGHT Compute can be used to profile our kernels in order to optimise them.

--

--