CUDA: Shared memory

Rustam
5 min readApr 4, 2024

CUDA shared memory is a type of memory accessible to all threads within the same block. It resides on the GPU chip itself, making it significantly faster to access compared to off-chip global memory.

Memory size depends on a GPU architecture and configuration. For example, in a GPU (compute capability 8.6), based on the Ampere architecture, the shared memory capacity per SM is up to 100 KB and maximum shared memory per thread block is 99 KB.

The maximum amount of shared memory available per thread block is typically smaller than the maximum shared memory partition available per Streaming Multiprocessor (SM). 1 KB of Shared Memory is reserved for system use and is not made available to thread blocks.

Dynamic & Static

  • The size of Statically allocated shared memory is known at compile time and remains constant throughout the execution of the kernel. (Probably may provide better performance as the compiler can optimize memory access patterns more effectively)
__shared__ int sharedMem[256];

But static memory allocation has a restriction: one cannot allocate over 48KB of Shared Memory, it will lead to compile time error: “uses too much shared data”.

  • To allocate memory a size is not known at compile time, one should use Dynamic Shared Memory. CUDA only supports a single dynamic allocation per block. This means that within a kernel, you should use extern __shared__ only once per block.

If you need multiple dynamically allocated memory regions within a block, you must use pointers to offsets within a single shared memory allocation.

// kernel
extern __shared__ int sharedMem[];
int* v1 = &sharedMem[0];
int* v2 = &sharedMem[10];
...

// specify a memory size as the 3rd parameter on kernel launch
kernel<<<BLOCKS, THREADS, 20 * sizeof(int)>>>(...);

Also, if you need over 48KB dynamic Shared Memory, you need to call a method cudaFuncSetAttribute before launching a kernel:

cudaFuncSetAttribute(KERNEL, 
cudaFuncAttributeMaxDynamicSharedMemorySize,
SIZE);
  • KERNEL — kernel name
  • SIZE — size of shared memory in bytes per block (up to shared memory capacity per SM)

Occupancy

When configuring an application, it’s crucial to consider the resources available on each Streaming Multiprocessor (SM). One important resource to manage is a Shared memory, which is distributed among thread blocks residing in the SM. If a thread block requires more shared memory than is available on an SM, that block won’t be scheduled and will wait for the first free SM where it can reside.

Warp & Banks

Shared memory in CUDA consists of 32 banks, organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.

A warp is a group of 32 threads within the same thread block. When a thread attempts to access shared memory, the access pattern across the threads in the warp can impact memory performance due to bank conflicts.

Bank Conflict

A bank conflict can occur only in the same warp

A bank conflict occurs when multiple threads within the same warp attempt to access memory locations that belong to the same memory bank simultaneously. This can lead to serialization of memory accesses, reducing memory bandwidth utilization and potentially slowing down the kernel’s performance.

Example (8 banks, warp size = 8, data size = 2x8):

To fix it when accessing a 2D array by column, we can add an extra column to the matrix. While this column doesn’t contain useful data for our calculations, it helps to ensure that each thread within a warp accesses data from different memory banks.

Nvidia Nsight

A simple example, that attempts to access to columns with a stride = 8:

sharedData[block.thread_rank() * 8] += data[globalId];

You can detect and analyze bank conflicts using Nvidia Nsight Compute. In the image below, you can observe that the source code exhibits 8-way bank conflicts, meaning each thread in a warp accesses to some banks 8 times. This results in increased load/store wavefronts, causing serialization and processing on different cycles, each occurring 8 times.

Wavefront is the maximum unit that can pass through that pipeline stage per cycle. If not all cache lines or sectors can be accessed in a single wavefront, multiple wavefronts are created and sent for processing one by one, i.e. in a serialized manner

Not OK — 896.000 BC’s

A fixed code block:

sharedData[block.thread_rank()] += data[globalId];
OK — no BC’s

Broadcasting

When 2 or more threads in a warp access to the same address in a bank, it will not result in a bank conflict. The data will be broadcasted with no affect to the performance.

More than the word size

When dealing with data elements larger than the word size (4 bytes), such as double precision floating-point numbers (8 bytes), the hardware still needs to ensure proper access to the data, and this may involve multiple transactions. Each transaction fetches a chunk of data, often referred to as a “cache line” or “coalesced access unit.”

You don’t need to care about that, hardware will do it itself.

The same applies if the data size is less than the word size (< 32 bits).

L1/Shared Memory

Shared Memory shares on-chip storage with the L1 cache. But Shared memory is explicitly controlled by the programmer and used for inter-thread communication and data sharing, while the L1 cache is managed by the GPU hardware and helps improve memory access latency and bandwidth by caching data and instructions fetched from global memory.

CUDA provides an API to set the carveout, i.e., the preferred shared memory capacity:

cudaFuncSetAttribute(kernel_name, 
cudaFuncAttributePreferredSharedMemoryCarveout,
carveout);

It is considered a hint by the driver, it may choose a different configuration, if needed. The driver may choose a different configuration if required to execute the function or to avoid thrashing. Don’t use it if you are not sure, increasing carveout may lead to decreasing L1 hit rate, consequently — result in a worse performance.

Conclusion

Thanks for reading, feel free to comment with corrections or ideas. And don’t forget to clap if you like this article :)

References

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://docs.nvidia.com/nsight-compute

--

--

Rustam
Rustam

Written by Rustam

My interests are C++, ML and CUDA. github: https://github.com/fatlipp

No responses yet