Per-Thread Program Counters: A Tale of Two Registers

Nvidia Volta GPUs came in with a bag of new features. From those dazzling tensor cores to independent thread scheduling. This post discusses the effect of per-thread Program Counters (PCs) in Volta and their impact on kernel’s register consumption.

Since Tesla, the first generation of CUDA-enabled GPUs, the scheduling of all the threads within the warp has been tracked by only one program counter (PC). If you have had a branch that divides the program flow for threads within the warp, because of this single PC, the warp has to visit all the possible execution paths, and mask off threads when they semantically aren’t supposed to be active.

This is not the case for Volta (and possibly post-Volta; we gotta wait for Turing). In Volta, every thread within the warp has a Program Counter that allows it to be scheduled independent of other threads inside the warp.

Picture from “Inside Volta” by Giroux and Durant, GTC’17.

This can actually expose some interesting programming possibilities. For example, we can now synchronize threads from the same warp executing different paths within the kernel; something that would not be possible in previous CUDA-enabled GPUs, or its manual implementation could make threads wait indefinitely.

Picture from “Inside Volta” by Giroux and Durant, GTC’17.

What’s the cost to pay?

Tracking the PC for every active thread on GPU would require dedicating some resources. Where do you think these resources come from in the case of Volta? Well, it’s from the resource domain of your application. Take a look at the footnote for a table on page 18 for Volta whitepaper that says

The per-thread program counter (PC) that forms part of the improved SIMT model typically requires two of the register slots per thread.

Let’s see what it means in action. Let’s analyze a toy CUDA kernel commonly known as SAXPY (Single-precision A*X Plus Y):

__global__ void saxpy(float* out, float a, float* x, float* y) {
out[ threadIdx.x ] = a * x[ threadIdx.x ] + y[ threadIdx.x ];

Compiled for SM 6.1 (most-optimized), cuobjdump tool with -res-usage flag reports using 8 physical registers. Each of these registers can be mapped to architected registers indexed from R0 to R7 inside kernel’s SASS code (first use of registers bolded):

code for sm_61
Function : _Z5saxpyPffS_S_
.headerflags @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"

/*0008*/ MOV R1, c[0x0][0x20];
/*0010*/ S2R R0, SR_TID.X;
/*0018*/ SHL R6, R0.reuse, 0x2;

/*0028*/ SHR.U32 R0, R0, 0x1e;
/*0030*/ IADD R2.CC, R6.reuse, c[0x0][0x150];
/*0038*/ IADD.X R3, R0.reuse, c[0x0][0x154];

/*0048*/ { IADD R4.CC, R6, c[0x0][0x158];
/*0050*/ LDG.E R2, [R2]; }
/*0058*/ IADD.X R5, R0, c[0x0][0x15c];

/*0068*/ LDG.E R4, [R4];
/*0070*/ IADD R6.CC, R6, c[0x0][0x140];
/*0078*/ IADD.X R7, R0, c[0x0][0x144];

/*0088*/ FFMA R0, R2, c[0x0][0x148], R4;
/*0090*/ STG.E [R6], R0;
/*0098*/ EXIT;

/*00a8*/ BRA 0xa0;
/*00b0*/ NOP;
/*00b8*/ NOP;

Compiling the same kernel for Volta (SM 7.0), however, results in 10 physical registers to be reported by cuobjdump -res-usage. Interestingly, convenient one-to-one mapping between reported physical register usage and architected registers visible in the kernel binary is not there anymore. If we include RZ as one of architected registers, and assume the number of consumed registers is rounded up to the closest even integer (i.e., register allocation granularity 2), where are R3 and R5??

code for sm_70
Function : _Z5saxpyPffS_S_
.headerflags @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
/*0000*/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ;

/*0010*/ MOV R1, c[0x0][0x28];

/*0020*/ S2R R6, SR_TID.X;

/*0030*/ MOV R7, 0x4;

/*0040*/ IMAD.WIDE.U32 R2, R6.reuse, R7.reuse, c[0x0][0x170];

/*0050*/ IMAD.WIDE.U32 R4, R6, R7, c[0x0][0x178];

/*0060*/ LDG.E.SYS R2, [R2];

/*0070*/ LDG.E.SYS R4, [R4];

/*0080*/ IMAD.WIDE.U32 R6, R6, R7, c[0x0][0x160];

/*0090*/ FFMA R0, R2, c[0x0][0x168], R4;

/*00a0*/ STG.E.SYS [R6], R0;

/*00c0*/ BRA 0xc0;

/*00d0*/ NOP;

/*00e0*/ NOP;

/*00f0*/ NOP;

This whole thing means that for every active thread you’ve got on your kernel, like it or not, two of your precious registers are gone. How bad is that? Well, in the full theoretical occupancy where 2048 threads are active on an SM, 16 KB of the total 256 KB of SM registers (one sixteenth) are out of your control.

Is it a big deal?

Well it really depends on the application. If your kernel doesn’t consume more than 30 registers per thread, you definitely won’t affect the theoretical occupancy. And you may actually (actually!?) benefit from the provided feature. If the occupancy of your kernel is super-important, if your app is near one of the steps in the occupancy calculator chart, and if it is the register usage (not shared memory usage or thread-block size) that is affecting the theoretical occupancy (that’s a lot of ifs), the benefits provided by independent thread scheduling may be cancelled out by the lack of enough resident threads on the SM. I am sure there have been internal discussions and application benchmarking within Nvidia when they intended to do such a thing. And they probably made sure that benefits for the majority of the apps they care about outweighed the drawbacks.

How to disable this independent thread scheduling?

If you’re targeting CUDA Compute Capability (CC) 7.0 or higher for the compilation of your app, it is enforced by the compiler and you cannot disable it. A way around would be to compile the app for the architecture with CC lass than 7.0; well, in that case you lose other cool features available only in CC 7.0 (and maybe higher).