The new techbro generation ignores the older C tricks.

2 min readApr 1, 2024

I have come across some GPU code sent for review. Unfortunately, I have realized in the last few years that the quality of NVIDIA GPU CUDA code has alarmingly decreased to the point that older models of GPUs can outmatch the performance of newer models with a much higher compute capability (CC).

As examples, in many scenarios, the simple technique of unrolling loops to access the dataset in chunks, just by using trivial bitwise operations, can boost your performance for most scenarios from 50 to 100%. However, it seems that newer CUDA developers have forgotten older tricks in C and just expect that the magic of CUDA will solve the problem under the hood.

Another worrisome tendency is an incomplete understanding of the memory layout, for example, sloppy use of global memory instead of shared memory, register memory overflow, ignoring texture memory, and almost no use of C++ wrappers with RAII that automatically free memory in complex scenarios. The most alarming thing is to observe in very expensive GPU models how most of their threading capability is idle (same warp following different execution paths due to poorly coded conditional branches) when the workload is most critical, or on the contrary, there is plenty of use of threading but the tech bros have used the wrong stride (architecture CUDA model wrong in their heads), and suddenly threads are starving and overloaded by trying to access the very same segments of data.

It is also worrisome the absence of using asynchronous data transfers to hide memory transfer latency, just by dividing the workload into chunks that can be processed independently. The use of graphs for multiple kernel launches is completely ignored. Simple arithmetic tricks like switching from double to single or even half-precision floating-point operations that can increase throughput by several times are also overlooked.

Forget about PTX optimization; just the simple use of C techniques with CUDA seems to have been forgotten or worse ignored by the new generation of GPU developers.

The new techbro generation ignores the older C tricks.

Written by Jose Crespo