Per-Thread Program Counters: A Tale of Two Registers

Farzad Khorasani
May 9, 2018 · 4 min read

Nvidia Volta GPUs came in with a bag of new features. From those dazzling tensor cores to independent thread scheduling. This post discusses the effect of per-thread Program Counters (PCs) in Volta and their impact on kernel’s register consumption.

Since Tesla, the first generation of CUDA-enabled GPUs, the scheduling of all the threads within the warp has been tracked by only one program counter (PC). If you have had a branch that divides the program flow for threads within the warp, because of this single PC, the warp has to visit all the possible execution paths, and mask off threads when they semantically aren’t supposed to be active.

This is not the case for Volta (and possibly post-Volta; we gotta wait for Turing). In Volta, every thread within the warp has a Program Counter that allows it to be scheduled independent of other threads inside the warp.

Image for post
Image for post
Picture from “Inside Volta” by Giroux and Durant, GTC’17.

This can actually expose some interesting programming possibilities. For example, we can now synchronize threads from the same warp executing different paths within the kernel; something that would not be possible in previous CUDA-enabled GPUs, or its manual implementation could make threads wait indefinitely.

Image for post
Image for post
Picture from “Inside Volta” by Giroux and Durant, GTC’17.

What’s the cost to pay?

Tracking the PC for every active thread on GPU would require dedicating some resources. Where do you think these resources come from in the case of Volta? Well, it’s from the resource domain of your application. Take a look at the footnote for a table on page 18 for Volta whitepaper that says

The per-thread program counter (PC) that forms part of the improved SIMT model typically requires two of the register slots per thread.

Let’s see what it means in action. Let’s analyze a toy CUDA kernel commonly known as SAXPY (Single-precision A*X Plus Y):

Compiled for SM 6.1 (most-optimized), cuobjdump tool with -res-usage flag reports using 8 physical registers. Each of these registers can be mapped to architected registers indexed from R0 to R7 inside kernel’s SASS code (first use of registers bolded):

Compiling the same kernel for Volta (SM 7.0), however, results in 10 physical registers to be reported by cuobjdump -res-usage. Interestingly, convenient one-to-one mapping between reported physical register usage and architected registers visible in the kernel binary is not there anymore. If we include RZ as one of architected registers, and assume the number of consumed registers is rounded up to the closest even integer (i.e., register allocation granularity 2), where are R3 and R5??

This whole thing means that for every active thread you’ve got on your kernel, like it or not, two of your precious registers are gone. How bad is that? Well, in the full theoretical occupancy where 2048 threads are active on an SM, 16 KB of the total 256 KB of SM registers (one sixteenth) are out of your control.

Is it a big deal?

Well it really depends on the application. If your kernel doesn’t consume more than 30 registers per thread, you definitely won’t affect the theoretical occupancy. And you may actually (actually!?) benefit from the provided feature. If the occupancy of your kernel is super-important, if your app is near one of the steps in the occupancy calculator chart, and if it is the register usage (not shared memory usage or thread-block size) that is affecting the theoretical occupancy (that’s a lot of ifs), the benefits provided by independent thread scheduling may be cancelled out by the lack of enough resident threads on the SM. I am sure there have been internal discussions and application benchmarking within Nvidia when they intended to do such a thing. And they probably made sure that benefits for the majority of the apps they care about outweighed the drawbacks.

How to disable this independent thread scheduling?

If you’re targeting CUDA Compute Capability (CC) 7.0 or higher for the compilation of your app, it is enforced by the compiler and you cannot disable it. A way around would be to compile the app for the architecture with CC lass than 7.0; well, in that case you lose other cool features available only in CC 7.0 (and maybe higher).

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store