What I wish I knew when I started programming CUDA (Part 1)

5 min readJul 11, 2020

Back when I learned CUDA I found myself in a plethora of online material and recourses, yet somewhat lost in the jungle. CUDA was the first GPU programming language that I ever learned, as well as my first time encountering parallel-programming, so I naturally fell on my nose finding bugs, not sure what to look out for, or thinking about certain aspects of the language. If you find yourself in the same position, maybe this article will help you a bit, or at least deepen your understanding of it.

Which learning material to use: When I learned CUDA I quickly found a lot of material. The thing however was that most of it was for older releases, and newer versions of CUDA already existed, making most of the older stuff obsolet. On top of that many recourses were too detailed and would clutter my attention with unnecessary details that come later. I personally found the article series written by Mark Harris enough to get started and beyond. While CUDA has progressed since then the content is still up to date, and the updates on CUDA are more in detail and for more specific use cases. I check it out again once in a while to get good examples. In case the series doesn’t help I highly recommend the CUDA best practices guide, and for those who are interested in further inner workings of CUDA the CUDA toolkit documentation. And finally, lately I came across this section from Tutorialspoint, that I personally found very cool on first sight.

SMs, the Grid, Blocks and Threads: This was a point I was particularly confused about. To me grids, blocks, and threads made a whole lot of sense, but then I did not see how they would connect to the Streaming Multiprocessors (SMs) NVIDIA is always referring to. To start with the grid, blocks and threads, that is easily explained.

In order to think about the recourses on the GPU, CUDA introduces the grid, which can be thought of as a board consisting of smaller computing units. Each of the units of the grid is then a block which hosts multiple threads. These are then the execution configurations of the kernel invocation, e.g.

foo<<<512, 1024>>>(output, inputA, inputB, 500);

does run on 512 blocks with 1024 threads each. That is 512 * 1024 = 524288 threads in total, each running once through. It should be noted that this number is fixed. For example, if I’d run the kernel

and N would be larger than 524288 while being the length of the input- and output-Arrays, not every element in the arrays would be processed. How would the kernel know the length of the arrays anyhow? After all, we’re giving it some raw pointers in the normal case, and while complex data structures are possible they’re definitely deprecated, for reasons I will talk later about.

So how do the SMs come into the equation? According to the CUDA Programming Guide we get the following:

When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

That is, the grid with the blocks and threads serves as an interface that NVIDIA gives us to abstract their hardware, and the blocks and threads per block are then mapped internally onto the SMs. The main reason for this design I can think is scalability, where NVIDIA can use one simple interface for all CUDA capable devices over all generations.

Warps: When I started learning CUDA I got confused about warps and what they represent, while the concept is quite important, and understanding it will deeply enhance your understanding of why they give advises as they do. To be honest, I underestimated to power of a warp, so let’s dive in for you to get better results faster.

A warp is pretty much like the AVX concept introduced by Intel and adapted by AMD. That is a set of processor instructions including the necessary hardware. The idea is that extra large registers can hold multiple instances of the same data type. Say for example a floating point number was occupying 32-bit on the machine, then an AVX capable register with a size of 128-bit can hold 4 floating point numbers, and thus simple arithmetic operations can be parallelised by adding registers each holding 4 floating points numbers. That concept is also called SIMD, standing for Single Instruction Multiple Data. The figure below depicts the data flow.

Now in CUDA a similar concept applies, called SIMT, which stands for Single Instruction Multiple Threads. That is, the GPU tries to execute the same instruction from multiple threads the same time, more precisely the instructions of 32 threads. Those 32 threads are thus grouped together, and that’s what is called a warp. So take again the kernel below:

Here, inputA and inputB are two arrays of type float and length N, and output is obviously an array of length N where the result is stored. Now every time the output is assigned, the GPU does not merely fetch inputA[i], but inputA[i] and the next 32 elements as well, and the same for array inputB, and executes the addition simultaneously on all 32 floating point numbers with one single instruction. Then, all 32 results are being stored simultaneously in the output array. In summary, a warp is a grouping of 32 threads trying to fetch similar data and executing similar instructions on them simultaneously, increasing throughput by roughly a factor of 32. And this concludes my post for today, in my following post I am going to talk a bit about the consequences of warps and how to best deploy them. Thank you for reading, and stay tuned.

What I wish I knew when I started programming CUDA (Part 1)

Written by Robby