Demystifying GPUs for CPU-centric programmers

Pekka Enberg
7 min readFeb 27, 2024

--

Over the years, I have learned a lot about how CPUs execute code and what their internals look like because I have worked on projects such as the Linux kernel and ScyllaDB, which are close to metal. I’ve even played a bit with Verilog in a futile attempt to build my own RISC-V core.

Unlike CPUs, GPUs have been mostly a black box to me, despite some exposure to them. I do remember playing around with NVIDIA RIVA 128 or something similar with DirectX when they were still 3D graphics accelerators and I have also tried to keep up with the times and did some basic shader programming on a contemporary GPUs. But I never really went deep with GPUs and am, therefore, thinking in CPU-centric terms.

However, because of the explosion of AI workloads, including large language models (LLMs), GPUs are becoming essential for modern compute. AI workloads are massive applications of tensor operations such as matrix addition and multiplication, which is a job for a GPU. But how does the modern GPU execute them and why are they much more efficient than running the workloads on a CPU?

Defining kernels to run on GPU

To dive into the world of GPUs, let’s look at CUDA, which is NVIDIA’s programming language that extends C to exploit data parallelism on GPUs. While most of AI programming today happens with high-level libraries such as PyTorch or TensorFlow, CUDA is still great to know to build intuition around how GPUs operate.

In CUDA, you write code for CPU (host code) or GPU (device code). CPU code is just mostly plain C, but CUDA extends the language in two ways: it allows you to define functions for GPUs, called kernels, and also provides a way to launch the kernels on the GPU. If you are wondering why GPU functions are called kernels, that’s because what you define in CUDA is executed in parallel unlike a function, which is executed serially.

If we take the example of vector addition where you perform an element-wise addition from vectors A and B in an output vector C, you would have something like the following as the kernel (the thing that runs on the GPU):

__global__
void vecAddKernel(float *A, float *B, float *C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}

The __global__ keyword in CUDA marks a C function as a kernel that runs on the GPU. The input A and B vectors and the output C vector are passed as parameters. All the vectors are of n elements in size.

The part where the kernel is doing actual vector addition is:

if (i < n) C[i] = A[i] + B[i];

But what are the variables blockDim, blockIdx, and threadIdx in the kernel used to calculate the element index i and the bounds check here?

Launching kernels on GPU from CPU

To answer the question, let’s first jump to the host code that is needed to launch the kernel, which would look something like this:

void vecAdd(float *A, float *B, float *C, int n) {
float *A_d, *B_d, *C_d;
int size = n * sizeof(float);

cudaMalloc(&A_d, size);
cudaMalloc(&B_d, size);
cudaMalloc(&C_d, size);

cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

vecAddKernel<<<ceil(n/256.0), 256>>>(A_d, B_d, C_d, n);

cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);

cudaFree(A_d);
cudaFree(B_d);
cudaFree(C_d);
}

As there are no particular keywords, it’s just a C function with the same parameters. The first thing you probably notice is the cudaMalloc, cudaMemcpy, and cudaFree function calls, which are part of CUDA’s heterogeneous memory management. Remember, this function executes on the CPU, which has access to the host memory. But GPUs have their memory, so you must allocate memory on the GPU and copy the input vectors. When the GPU kernel completes, you need to copy the results back to the host memory space and free up the memory used on the GPU.

Also, to launch the kernel, the CUDA syntax looks like a weird templatized function call, but what are the two additional configuration parameters? The first configures the number of blocks in a grid, and the second specifies the number of threads in a block. But what on earth is a grid? In CUDA, a kernel executes on a grid, which is blocks of threads. When we launch a kernel, we determine the grid with the two configuration parameters.

And if we go back to the kernel definition, we can see the unique variables blockDim, blockIdx, and threadIdx in use, which is the part where we define what part of the grid the kernel executes.

__global__
void vecAddKernel(float *A, float *B, float *C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}

The blockDim variable specifies block dimensions because CUDA allows blocks to be three-dimensional. We only have a one-dimensional block for our vector example, so that part is straightforward. On the other hand, the blockIdx variable tells us which block the kernel is executing. Similarly, the threadIdx variable tells us which hardware thread executes the kernel.

In the kernel launch, we had the following, which means there are n/256 blocks, rounded up to the nearest integer and 256 threads per block.

void vecAdd(float *A, float *B, float *C, int n) {
// ...
vecAddKernel<<<ceil(n/256.0), 256>>>(A_d, B_d, C_d, n);
// ...
}

For example, if the vector size n is 1000, we have four blocks and 256 threads per block, 1024 hardware threads. And that’s where the boundary check i < n in the kernel definition comes in handy: the first 1000 threads perform addition, but the remaining 24 threads are essentially a no-op.

So how is the execution on the GPU different from the CPU?

If we were to implement the same vector addition on a CPU, we would start out with something as follows, which has an explicit loop to iterate over all the n elements, performing an addition for every loop iteration:

void vecAdd(float *A, float *B, float *C, int n) {
for (int i = 0; i < n; i++) {
C[i] = A[i] + B[i];
}
}

You might then attempt to use SIMD instructions to speed up the function either by manually unrolling the loop and using SIMD instructions via compiler intrinsics or assembly. Or you would turn on your compiler auto-vectorization optimization flags to make the compiler do that for you. But either way, the loop in function would always run serially across a batch of elements.

However, for GPUs, we specify parallelism as part of the kernel launch and only define the serial part that runs on GPU hardware threads of the vector addition in CUDA. That is, on the GPU, there is no loop for vector addition. Instead, every vector element addition runs its own thread. And now if you’re a CPU-centric programmer, you’re probably wondering how that can ever be fast.

That’s because the cost of a GPU hardware thread is minimal compared to operating system threads you would use on the CPU. The GPUs internally have a large register file compared to CPUs, meaning switching between threads is almost free because there’s no need to save and restore registers. Furthermore, thread scheduling happens at the hardware layer, which means you don’t need to transition into privileged kernel code for scheduling. What this all adds up is that you can expect a GPU thread creation and scheduling to take a few GPU cycles, whereas a POSIX thread, for example, can take microseconds to create and schedule.

Furthermore, GPUs optimize for parallelism further with warps, which is a set of 32 hardware threads, which all execute the same instruction at a time. So now you can imagine how 32 ALUs are executing the per-element addition in parallel, sharing the same control logic. For CPU-centric programmer, think of this as if there was a really large SIMD instruction (The warp control flows can also diverge, but the GPU thread scheduler handles that as well, by executing just parts of the warp at a time, depending on what control flow they take.)

However, a key difference to SIMD, if there is a cache miss on the GPU, execution can quickly switch to other warps and proceed in parallel while the memory loads are in-flight. In other words, GPUs optimize for high throughput where you throw a massive amount of hardware threads at a parallel data problem, whereas CPUs optimize for low latency execution of individual instructions.

tl;dr;

If you’re a CPU-centric programmer, one way to think about execution model of GPUs is that instead of looping over a data set, you essentially break down the job in lots and lots of threads, which work on a subset of the data. Of course, there are some key differences on a GPU compared to a CPU. First of all, you define a kernel instead of a function and specify parallelism explicitly when the kernel is launched. But also, context switching between GPU threads is orders of magnitude cheaper than OS threads, which is why you can just throw lots and lots of threads at a problem. Finally, GPUs further optimize thread execution by grouping a bunch of threads as warps, which execute using shared control logic on parallel ALUs. And if there is a cache miss, the GPU can quickly switch to another warp while the memory load happens in the background.

--

--

No responses yet