CUDA Memory Management & Use cases

Dung Le
Distributed Knowledge
10 min readJul 29, 2020
Figure 1: Nvidia GeForce RTX 2070 running Turing microarchitecture. Source: Nvidia

In my previous article, Towards Microarchitectural Design of Nvidia GPUs, I have dissected in-depth a sample GPU architectural design, as well as its memory hierarchy. In order for GPGPU applications to achieve the true performance that GPU promises, a throughout understanding and correct use of each memory space in the hierarchy is a must. In this article, let’s discuss on how to optimally utilize different types of GPU memories and cycle through some notable use cases for each memory type.

The content of this article will be organized as follows:

  1. Coalesced & un-coalesced global memory access
  2. Efficient matrix transpose using shared memory
  3. Optimizing convolution with constant memory and memory padding
  4. Takeaways
  5. What’s next?

I. Coalesced & un-coalesced global memory access

The global memory of a CUDA device is implemented with DRAMs. Each time a DRAM location is accessed, a range of consecutive locations that includes the requested location is actually accessed. Many sensors are provided in each DRAM chip and they work in parallel. Each senses the content of a bit within these consecutive locations. Once detected by the sensors, the data from all these consecutive locations can be transferred at very high-speed to the processor. These consecutive locations accessed and delivered are referred to as DRAM bursts.

Recognizing the burst mechanism, current CUDA devices employ a technique that allows the programmers to achieve high global memory access efficiency by organizing memory access of threads into favorable patterns. This technique takes advantage of the fact that threads in a warp execute the same instruction at any given point in time (SIMT). When all threads in a warp execute a load instruction, the hardware detects whether they access consecutive global memory locations. If they do, the hardware combines, or coalesces, all these accesses into a consolidated access to consecutive DRAM locations. Such coalesced access allows the DRAMs to deliver data as a burst.

To optimally utilize global memory, it is important to improve coalescing. There are multiple strategies that can be used. One such strategy is to improve the data access pattern. Let’s take a simple kernel to perform matrix multiplication with a scalar value as an example. In CUDA, multidimensional-array elements are placed into the linearly addressed memory space according to the row-major convention as following:

Figure 2: Linearized-order of matrix elements in increasing address.

Now, let’s consider two ways of writing kernel with two different access patterns:

For the above access pattern, as threads’ indexes on x-dimension increase within a warp, they access consecutive locations on the matrix array m. Therefore, when consecutive threads issue load instructions to global memory to access m, these accesses form a coalesced access pattern.

On the other hand, notice that threads within a warp are issuing load instructions in column-major order the above kernel, in which two consecutive accesses will be distant dimx locations away from each other. This results in an un-coalesced access pattern, and henceforth reduce kernel global memory bandwidth.

Another strategy is to change the data layout to improve locality. Depends on how the program accesses data, SOA (structure of arrays) and AOS (array of structs) data type can be used for consecutive threads to issue load/store instructions to consecutive memory locations. For example, computer vision algorithms that need to apply filters onto an image requires the image to be stored onto a data structure. With SOA or AOS data structure, the image can be stored as follows:

Due to the fact that memory addresses of class data members are placed linearly in memory, in case of algorithms whose kernel threads access r, g, b value of each image pixel at the same time, storing the image as SOA type will allow coalesced memory access. On the other hand, in the scenario that kernel threads need to access the values in image’s channels at the same time, AOS type will allow coalesced memory access.

II. Tiling matrix transpose using shared memory

In contrast to global memory which resides in DRAM, shared memory is a type of on-chip memory. This allows shared memory to have a significantly low memory access latency for just several instruction cycles per instruction. One can relate shared memory usage to a CPU cache; however, while CPU cache cannot be explicitly managed, shared memory can. Shared memory can be declared by the programmer by using keyword __shared__, with size hardcoded in the kernel code or passed on explicitly to the kernel call using extern keyword.

With low latency accessing, shared memory is utilized heavily in programs in which memory bound is a problem. One key usage of shared memory comes from the fact that threads within a block can share memory access. Therefore, different threads can use shared variables to hold the data that was reused many times during the computation phase. In order to maximize memory bandwidth, threads can load this data from global memory in a coalesced manner and store it into declared shared memory variables. Threads then can load or write the data in any order due to the fact that shared memory is not affected by un-coalesced read/write order (corner turning technique). For example, in the problem of optimizing matrix transpose performance, this usage of shared memory comes in handily:

In order to have coalesced reads/ writes into global memory, threads load (line 13) and write (line 21) a tile of input data to consecutive memory locations as threadIdx.x varies. At line 13, each block of threads loads a tile of input data having block-sized float numbers from global matrix t into shared memory array. It then makes sure all threads finish loading their element by using __syncthreads, a barrier that synchronizes threads within each block. Each thread then writes into consecutive locations in the global memory as variable to varies in the same amount as threadIdx.x.

III. Optimizing convolution with constant memory, shared memory, and memory padding

The convolution operation (or filtering) is another common operation in many applications, especially in image and signal processing. It consists of source data and a filter (known as mask). By applying the filter against matrix data, we can obtain the convoluted matrix. Let’s divide our use cases into using convolution operation on 1D array and 2D matrix.

1. 1D Convolution

a. Naive convolution kernel

Figure 3: 1D Convolution. Source: Analytics Vidhya

Figure 3 is an example of a convolution operation in a 1D array. For elements that are near the array bounds such as c1 and c6, we substitute 0s to make up for the missing cells (or ghost cells) in the sum that constitute them. Hence, c1, for example, equals to 0 * w1 + i1 * w2 + i2 * w3. A naive kernel for 1D convolution may be written like this:

We can make two observations about the kernel in the naive implementation. First, there will be a control flow divergence. The threads that calculate the output elements near the left end or the right end of the output array have to handle the ghost cells while the other threads do not. Second and a more serious problem is memory bandwidth. For every two calculations in line 7, two global memory reads are issued, which makes the ratio of floating-point arithmetic calculation to global memory access is only about 1.0 in the kernel. Therefore, this simple kernel can only be expected to run at a small fraction of peak performance.

b. Using constant memory for filter

Let’s notice the filter used in convolution operation. First, it’s unchanged throughout the convolution computation. Second, the filter size tends to be small in most cases. Hence, these properties make the filter array a great candidate for constant memory, which is defined with keyword __constant__ in CUDA. Like global memory variables, constant memory variables are also located in DRAM. However, since CUDA knows that constant memory variables are not modified during kernel execution, GPU hardware caches the constant memory data aggressively in L1 or constant cache. In case of a cache hit, data will be served from the cache instead of going down to global memory. Furthermore, the design of caches is typically optimized to broadcast a value to a large number of threads. As a result, when all warp access the same constant memory variable, as in the case of convolution masks, the caches can provide a tremendous amount of bandwidth to satisfy the data needs of threads. Also, as previously stated, the size of the masks is typically small so we can assume that all mask elements are effectively always accessed from caches. Now, let’s turn to the updated kernel code using constant memory for the filter array:

c. Optimize memory bandwidth with shared memory

To further optimize memory bandwidth, notice that each input element is used for more than one output element, and, henceforth, loaded multiple times from global memory. Therefore, to reduce global memory load and improve memory bandwidth, we will pre-load all elements need for the calculation into shared memory. For example, in figure 3, assuming our thread block calculates three output elements c1, c2, c3, the block needs to load all input elements used in c1, c2, c3 calculations, which are i1, i2, i3, and i4. To generalize this idea, let’s look at a sample code snippet:

Threads within a block will collaboratively load (tile_size+ mask_width / 2) input elements used for convolution in tile_size output elements into shared memory. The last threads load first (mask_width / 2) elements, the first threads load the last (mask_width / 2), and the other threads load the rest. This helps reduce the number of global memory loads since each element needed in the input for the convolution is loaded once. To be specific, all threads only need to load (array_width + mask_width) input elements from global memory. On the other hand, in naive kernel implementation, for each idx of a thread that maps to a valid idx index in the output array, each thread needs to load mask_width input elements. In total, all threads in naive implementation issue (mask_width * array_width) global memory transactions.

2. 2D Convolution

a. Memory Padding

Now, let’s move on to 2D matrix convolution implementation. 2D matrix convolution is heavily applied in the realm of computer vision since real-world images are represented as 2D matrices and come in all sizes and shapes. These matrices are generally stored in the row-major layout when reading from files to memory. If the width of the image in terms of bytes is not a multiple of the DRAM burst size, the starting point of row 1 and beyond can be misaligned from the DRAm burst boundaries. Such misalignment can result in poor utilization of DRAM bandwidth as there might be the case where a row is spanned between two DRAM bursts, which requires two DRAM bursts to deliver the whole row data instead of one.

Figure 4: An example of 2D Convolution. Source: tex.stackexchange

Therefore, in order to utilize the strength of DRAM burst to deliver data fastly, we will pad additional bytes into each row of the image matrices so that each row ends at the DRAM burst boundaries:

Figure 5: A padded image format

After padding, each row of the image matrix will have a length of pitch units. Therefore, when the image is linearized in row-major order in DRAM memory, we need to use pitch instead of width to access the element at (row, col) coordinate: row * pitch + column. However, when we iterate through a row, width is still needed to be used as the loop bound to calculate the elements that actually exist in the original matrix. That brings us to the kernel code for 2D convolution using memory padding:

Each thread of the kernel first calculates the y and x, or col_o,and row_o, indices of its output element (line 5 and 6). It then calculates the y and x indices of the input element it needs to load into the shared memory by substracting (mask_width / 2) from row_o and col_o (line 8 and 9). Applying the same idea of dealing with halo cells in 1D convolution, each thread block loads all halo cells and internal cells needed by all threads in the block for the convolution operation (line 13). Remember that when loading input elements from the padded image, we need to use “pitch” as the row length instead of width (line 14), but when doing convolution operations, we will use “width” as the loop bound (line 26).

So how memory-efficient this kernel is compared to the naive 2D kernel? In a naive kernel, each thread in a thread block will perform (mask_width)² accesses to the image array in global memory. Hence, each thread block issues a total (mask_width)² * (tile_size)² accesses to global memory. In the tiled kernel, all threads in a thread block collectively load one input tile. Therefore, the total number of access by a thread block to the image array is (tile_size + mask_width-1)². Hence, memory access speedup is (mask_width)² * (tile_size)² / (tile_size + mask_width-1)². The larger the ratio, the more effective the tiled kernel in reducing the number of memory accesses as compared to naive kernel.

IV. Takeaways

A throughout understanding and correct use of different memory space in the GPU device will deliver a large speedup for your applications. For data that is unchanged throughout the kernel execution, consider using constant memory to utilize hardware caching for optimized load performance. For data that is accessed repeatedly during the kernel execution, consider using shared memory to reduce global memory loads, and utilize the corner turning technique. When you have to go down to global memory, remember that a coalesced memory read/ write access pattern leads to a much more optimized memory bandwidth than an un-coalesced read/write pattern, as the former can improve its performance based on DRAM bursts mechanism.

V. What’s next?

For the next article, let’s take a look at another two types of CUDA memory, texture memory and unified memory, and ways that they can help to boost GPGPU applications.

Stay tuned for the next one!

--

--