CUDA — High-Level Guide

10 min readOct 5, 2021

What should you expect from this post?

This post won’t teach you how to code in CUDA. In order to code in CUDA, first, we have to acquire the skill set of how to approach a problem in a GPU setting. Think of it like this, without structuring the algorithm, we can’t even begin to code. This post aims to provide you with the necessary GPU-mindset to approach a problem, then construct an algorithm for it.

The Fundamental GPU Vision

The best way to compare GPU to a CPU is by comparing a sports car with a bus. A sports car can go much faster than a bus, but can carry much fewer passengers in it. If we were to carry too many passengers for a sports car, carrying this many passengers with a Bus can be faster. Since the sports car would have to carry these passengers in multiple iterations, whereas the bus can carry all of them in a single trip.

When to use GPU?

Say we have a big problem, chunked into parallelizable smaller pieces. Let’s refer to the count of the pieces with piece_count.

Both GPU and CPU can handle some work in parallel. Let’s refer to their maximum worker counts by GPU_worker_count and CPU_worker_count. And yes, worker is roughly a thread in this case (there will be exceptions, that is why it is roughly).

But GPU’s speed and CPU’s speed do differ. Let’s define CPU_worker_speedand GPU_worker_speed as well, these will correspond to how fast they can run a single worker. Now, we can establish an equation as the following:

CPU_max_worker_count = min(piece_count, CPU_worker_count)

GPU_max_worker_count = min(piece_count, GPU_worker_count)

Now, recalling basic logic:

The more workers you have, the faster the job will get done
The faster your workers work, the faster the job will get done

CPU_performance = (CPU_max_worker_count) * (CPU_worker_speed)

GPU_performance = (GPU_max_worker_count) * (GPU_worker_speed)

By comparing GPU_performance to CPU_performance, we can select the greater one, and that processing unit will be faster. As one can see, the size of the work itself, and its structure for parallelism/dependencies have a huge impact on the decision. These shortcut equations can be verified with the longer equation format:

GPU Programming In A Nutshell

Threads have ID’s in GPU. To tell which work goes to which thread, we are not writing copies of the same code, nor we do not send the work to threads. The same code will be run by each thread in GPU programming. However, each thread will replace the threadID in the code, with its own ID. So, to distribute work amongst threads, we have to structure our work accordingly. For example, say we want to increment a vector's each element by 1. The corresponding GPU code will be:

vector[threadID] += 1

Whether the code is for parallelism purposes or not, it will be run by each thread!

This is very important, since print statements for debugging purposes will also be printed thousands of times. If we want to execute a command via a single GPU thread, we can do it like this:

if threadID == 0 {
	print("Hey, I'm thread 0);
}

So, this code will be run by the other threads as well, but other threads will not go into the if statement, since their threadIDs will differ from 0.

Great! But how do even GPU and CPU communicate? Or do they have to? Let’s address this problem.

CPU & GPU connection

We cannot invoke the GPU code by itself, unfortunately. CPU has to call GPU to do the work. Good news: CUDA code does not only work in the GPU, but also works in the CPU. To instantiate the GPU, send work to it, and get the work back from the GPU to the CPU, we will be using CUDA code. Basically:

we will write some CUDA code for the CPU (which is responsible for allocating memory in the GPU, and other technical details)
then this code will invoke the kernels written for GPU
GPU code will be run
if we want to use this processed data (which is 99% of the time), we will copy the data in the GPU back to the CPU
if there is more to do, we can do whatever we want with this data in the CPU as well.

Kernels & Streams

Kernel

The code that is being run by the GPU is a kernel. So, we call the kernel function from the CPU, and then the code inside this kernel function will be executed by the GPU. The kernel's code will be executed by all the threads in that kernel, simultaneously. Yes, we can select how many threads will be in the kernel. However, if we send multiple kernels to the GPU, these kernels will be sequentially ordered in GPU. Here is an example:

Kernel1 → uses 1000 threads

Kernel2 → uses 4000 threads

Kernel3 → uses 50 threads

If we issue these 3 kernels to the GPU, first Kernel1 will be finished, after that Kernel2 will be processed, and finally, Kernel3 will be handled. If these kernels are not dependent on each other, we can also run these kernels concurrently.

Stream

Each Kernel is being issued on a Stream in the GPU. Think of streams as the lanes of a highway. If we do not specify a Stream when calling a Kernel, it will be issued on the default Stream (say, the leftmost lane). The Kernels issued in the same Stream must wait for each other (without switching lanes, cars have to wait for the other cars that are in front of them). In other words, each stream can handle 1 kernel at a time. By issuing kernels to different streams, we can run multiple kernels simultaneously (cars won’t have to wait for other cars, that are in other lanes).

Since we got the basics of how the GPU code will be invoked from the CPU, and some basic terminology of GPU coding, let’s move to more essential things.

Thread, Block, Grid

In GPU, each thread is located in a block, and each block is located in a grid. There is only 1 grid per Kernel, but we can have many blocks inside this grid, in a 2 dimensional way. For example, we can have 6 blocks in a grid, and 3 blocks per row, 2 blocks per column.

Why we would do that?

We may want to choose that, in order to structure the problem in a more basic way. Say, we are going to work on a 2-dimensional array instead of a 1-dimensional one. So it really depends on the problem, and we can structure our workers in a multi-dimensioned way.

Similarly, for each block, we can have many threads, and these threads can be placed in a 3-dimensional fashion in a block.

How are these useful, and how are they affecting the code?

Inside the code, each thread has access to:
1. its x, y, and z coordinate (with respect to its location in the block).
2. how many threads are there in its own block (size of the block’s x, y, and z dimensions),
3. how many blocks are there in the grid (in x and y dimensions)
4. its block’s x and y coordinates (with respect to the block’s location in the grid).

Streaming Multiprocessors

Streaming Multiprocessors (SMs) are like the brains/managers of GPU. Threads do not run by themselves, SMs run threads. Each SM has its own scope, and each SM has its own resources.

As can be seen in the above picture, each SM can instruct 32 cores at a time. So basically, the GPUs maximum parallelization limit will be its SM_count x 32. This is also the very reason for, each SM can run 32 threads in parallel at maximum. In other words, the total core count in NVidia GPUs will give the total threads that can be run in parallel.

However, the concurrency limit is higher than that parallelism level. Using their caches, SMs can context switch very quickly. So the performance limit in a GPU is not purely determined by the maximum thread count that can be run in parallel. Hence, these words were present in above:

And yes, worker is roughly a thread in this case (there will be exceptions, that is why it is roughly).

An advanced trick: SMs are actually running a single command, for 32 threads in parallel. If the command to be run for these 32 threads differ via branching (an if statement can do that), then these different commands will be run sequentially. That means, performance degradation. It is really important to not branch the kernel. Try to write branchless code :)

An example:

Bad for the GPU (SM may have to execute the command inside the if for some threads, and not execute it for other threads):

if a[threadID] > 2 {
    a[threadID] += 5;
}

Good for the GPU (no matter what, SM will execute this code for all 32 threads):

a[threadID] += (5 * (a[threadID] > 2))

Let’s go back to SMs. Everything in the above figure, is under the SM’s responsibility. In the next and last section, we will talk about the Memory Hierarchy. And we will see that each thread has its own register. However, threads are not reaching their registers on their own, SM’s control these. Everything related to threads is done by the SM. Think of threads as the puppets attached to the strings manipulated by an SM (which has 32 puppets in this metaphor).

Memory Hierarchy

This is topic is somewhat overrated in its complexity. Yes, it is harder than the other CUDA topics, yet quite simple to understand when properly explained.

We are starting right away!

Definitions And Use Cases

Global Memory: this is the general-purpose memory. It is the default, and the biggest one. We can use it for everything. However, if some other memory type is sounding more specific for what we are doing, they will be probably more efficient. This should be our last resort, general-purpose option. As can be seen from the above illustration, the Host can directly write to Global, Constant, and Texture Memory. For all the other memory types (Shared Memory, Registers, Local Memory), they should get written from Global Memory, Texture Memory, or Constant Memory. In other words, we cannot directly write into Shared Memory for example, from the Host. And mostly, the intermediate step will be to write into Global Memory (because of its huge capacity, and not being read-only).
Constant Memory: the name is quite informative. For the constant values in our program, we should use this instead of the global memory. Constant memory is way faster than global memory.
Texture Memory: this one is good for spatial accesses. If we are accessing neighbors in a matrix, for example, texture memory may come in handy. The difference relies on here: in other memory types, you can either make rows or columns as consecutive items in memory. However, texture memory optimizes that, and grants faster access times for simultaneous row and column accesses. However, it is read-only.
Shared Memory: if we are using the same memory more than once (read/write, doesn’t matter), and if the data is small enough to fit in shared memory, this should be our choice instead of global memory. Remember that threads are located in blocks, and there is another restriction for shared memories. The threads cannot read/write a shared memory of another block. If we can deal with this restriction, and our data is not Gigabytes, using shared memory instead of global memory will improve our performance a lot. A good comparison would be: imagine reading the same data (which resides in Global Memory) 5 times. If we do not utilize Shared Memory, we would be accessing Global Memory 5 times. But if we do utilize Shared Memory, this would be accessing Shared Memory 5 times + accessing Global Memory 1 time (to copy what we need to Shared Memory, we have to access Global Memory at least once). So, if 5x Shared Memory access is lesser than 4x Global Memory access, we will be gaining performance.
Registers: Each thread has registers. These are the fastest memory type as expected. We will use them even if we don’t want to. There is one nifty detail though, each thread is run inside a warp, and threads that are being run in the same warp can read others’ registers. This warp level register read is faster than shared memory. It will be hard to create such a scenario that will benefit from this, but when it happens, we should not forget about this little hack!
Local Memory: when the kernel uses more registers than available on the SM, the overflowing variables will spill into local memory. This is not as fast as registers (not even close), and much much slower than the shared memory, hence, should be avoided at all costs. If we are using too many registers and memory starts to spill to local memory, we should start using shared memory instead of the overflowing ones.