Vector — An owning type efficiently accessible to both host and device in CUDA C++

CUDA C++ Modern Unified Memory Programming — Part I

8 min readMay 12, 2024

In Part I of the series, Vector data structure in C++, which can be accessed by both host (CPU) and device (GPU) efficiently, is to be loosely implemented to be used with dot product kernel to be implemented in Part II of the series.

One of the challenges of heterogeneous programming is memory. The usual approach to handle memory was to allocate memory on both host and device and to copy data explicitly from host to device or from device to host as needed. There are many examples of the usual approach in [4], which might discourage beginners.

With the inclusion of Unified Memory in the programming model, a “bridge” between CPU and GPU is constructed. Thanks to that bridge,

allocating memory on both sides (host and device) is not necessary because memory is migrated on demand,
memory is accessible through a single pointer, whereas with the usual approach, host and device pointers differ,
complexity of the programs is decreased [2].

Last but not least, Unified Memory opens the door to use C++ more in CUDA programs, which is the main point of this article. The importance of using C++ more frequently is very evident, when the programmer writes code that is not very readable due to lack of abstractions. For instance, despite being a very good resource, if not the best, for learning CUDA, sample codes in the book [3] are written in CUDA C. As a consequence, even indexing a flattened array which represents a row-major-ordered matrix is difficult to follow and understand, especially as the algorithm gets sophisticated, e.g., code for thread coarsening for tiled matrix multiplication in [3].

Another advantage of creating classes is the information which might be encapsulated in them such as dimension information of Vector or Matrix classes, which might be used in kernels or in class member functions. That information is generally passed by value to kernels, increasing the argument count, when classes are not used, as in code samples of [4]. In the following sections, different implementations of Vector class are to be presented.

A non-owning type passed by value

A class relying on shallow copy might be used via pass-by-value as in code sample in Shared Memory subchapter of [1]. Vector class might be implemented as follows.

template<class Type>
struct Vector
{
  using ValueType = Type;
  using SizeType = std::size_t;
  // ...
  ValueType *data_ = nullptr;
  SizeType size_ = 0;  
};

However, that requires users of the class to handle memory, which is not very convenient. Since instances of the class pass by value to kernel, the memory cannot be handled by the class itself via RAII (Resource Acquisition Is Initialization).

#include <iostream>
#include <numeric>

// ...

template<class Type>
constexpr Type zero = 0.0;

template<class Type>
constexpr Type one = 1.0;

int main()
{
  constexpr int N = 1024 * 1024 * 256;
  constexpr int blocksPerGrid = 256 * 4 * 2;
  constexpr int threadsPerBlock = 1024;
  using ValueType = float;
  using SizeType = std::size_t;
  using VectorType = Vector<ValueType>;

  VectorType vectorOfOnes;
  vectorOfOnes.size_ = N;
  cudaMallocManaged(&vectorOfOnes.data_, N * sizeof(ValueType));

  fillPassByValue<<<blocksPerGrid, threadsPerBlock>>>(vectorOfOnes, one<ValueType>);

  VectorType partials;
  partials.size_ = blocksPerGrid;
  cudaMallocManaged(&partials.data_, blocksPerGrid * sizeof(ValueType));

  dotPassByValue<<<blocksPerGrid, threadsPerBlock>>>(partials, vectorOfOnes, vectorOfOnes);
  cudaDeviceSynchronize();
  
  ValueType value = std::accumulate(partials.cbegin(), partials.cend(), zero<ValueType>);
  std::cout << value << std::endl;

  cudaDeviceSynchronize();
  cudaFree(vectorOfOnes.data_);
  cudaFree(partials.data_);
  return 0;
}

An owning type passed by reference relying on managed memory

In order to pass by reference of instances of classes relying on RAII for automatic memory management, new and delete operators might be overloaded as follows as in [2].

struct Managed
{
  void *operator new(std::size_t len)
  {
    void *ptr;
    checkCudaError(cudaMallocManaged(&ptr, len));
    checkCudaError(cudaDeviceSynchronize());
    return ptr;
  }
  void operator delete(void *ptr)
  {
    checkCudaError(cudaDeviceSynchronize());
    checkCudaError(cudaFree(ptr));
  }
};

Classes need to inherit from Managed class. For instance, Vector class could be implemented as below.

template<class Type>
class Vector : public Managed
{
  public:
  using SelfType = Vector<Type>;
  using ValueType = Type;
  using SizeType = std::size_t;
  Vector(SizeType size) : size_{size} { allocateUnifiedMemory(); }
  Vector(const Vector&) = delete;
  Vector& operator=(const Vector&) = delete;
  Vector(Vector&&) = delete;
  Vector& operator=(Vector&&) = delete;
  ~Vector() { freeUnifiedMemory(); }
  // ...
  private:
  // ...
  ValueType *data_ = nullptr;
  SizeType size_ = 0;
};

Although copy constructor above is explicitly deleted, it could also be implemented as in [2] . The following is a possible use case of the implementation above.

#include <iostream>
#include <memory>
#include <numeric>

// ...

template<class Type>
constexpr Type zero = 0.0;

template<class Type>
constexpr Type one = 1.0;

int main()
{
  constexpr int N = 1024 * 1024 * 256;
  constexpr int blocksPerGrid = 256 * 4 * 2;
  constexpr int threadsPerBlock = 1024;
  using ValueType = float;
  using VectorType = Vector<ValueType>;
  
  auto ptrVectorOfOnes = std::make_unique<VectorType>(N);
  VectorType& vectorOfOnes = *ptrVectorOfOnes;

  fill<<<blocksPerGrid, threadsPerBlock>>>(vectorOfOnes, one<ValueType>);

  auto ptrPartials = std::make_unique<VectorType>(blocksPerGrid);
  VectorType& partials = *ptrPartials;

  dot<<<blocksPerGrid, threadsPerBlock>>>(partials, vectorOfOnes, vectorOfOnes);
  cudaDeviceSynchronize();
  
  ValueType value = std::accumulate(partials.cbegin(), partials.cend(), zero<ValueType>);
  std::cout << value << std::endl;
   
  return 0;
}

Although this works fine, there are a few drawbacks of this method. First, heap allocations are expensive. Second, in order to pass by reference of the instance of the class, it is necessary to dereference the pointer. To relieve the burden of dereferencing every time, an alias might be created as above, which is just an extra line but still a burden. Thirdly, it is necessary for every class to inherit from Managed class. As denoted in [2], even classes, composed of classes inheriting from Managed class, need to inherit from Managed class.

An owning type passed by reference relying on system allocated memory

According to [1], system allocated memory is part of the Unified Memory, when the following requirements are satisfied:

modern Linux systems with HMM,
compute capability of at least 7.5
CUDA driver version 535+ installed with Open Kernel Modules.

Vector class (or any other class) does not need to inherit from Managed class anymore. No other change is required on the data structure for now. The real difference is experienced by the users of the class. Now, the code looks cleaner and much more like native C++, since it is very uncommon to have an object of type std::unique_ptr<std::vector<float>>.

#include <iostream>
#include <numeric>

// ...

template<class Type>
constexpr Type zero = 0.0;

template<class Type>
constexpr Type one = 1.0;

int main()
{
  constexpr int N = 1024 * 1024 * 256;
  constexpr int blocksPerGrid = 256 * 4 * 2;
  constexpr int threadsPerBlock = 1024;
  using ValueType = float;
  using VectorType = Vector<ValueType>;
  
  VectorType vectorOfOnes(N);
  fill<<<blocksPerGrid, threadsPerBlock>>>(vectorOfOnes, one<ValueType>);

  VectorType partials(blocksPerGrid);
  dot<<<blocksPerGrid, threadsPerBlock>>>(partials, vectorOfOnes, vectorOfOnes);
  cudaDeviceSynchronize();
  
  ValueType value = std::accumulate(partials.cbegin(), partials.cend(), zero<ValueType>);
  std::cout << value << std::endl;
   
  return 0;
}

Comparing performance of two owning type implementations

Two different approaches are presented to pass an instance of Vector class by reference to kernel. It is clear that system allocated unified memory is convenient to use. Before discarding managed unified memory, that is, the one inheriting from Managed class, and moving on with system allocated unified memory, performance of both approaches must be examined and necessary changes in Vector class must be made. Profile of dot and fill kernels with managed unified memory can be found below.

Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   50.01%  4.1963ms         1  4.1963ms  4.1963ms  4.1963ms  void fill<Vector<float>>
                   49.99%  4.1953ms         1  4.1953ms  4.1953ms  4.1953ms  void dot<unsigned long=1024, Vector<float>>
      API calls:   86.59%  196.31ms         4  49.076ms  7.0530us  196.14ms  cudaMallocManaged
                    4.28%  9.7063ms         4  2.4266ms  12.572us  9.5718ms  cudaFree
                    3.72%  8.4433ms         9  938.14us  4.3300us  4.1988ms  cudaDeviceSynchronize
                    3.55%  8.0480ms         6  1.3413ms  7.1940us  7.9174ms  cudaMemPrefetchAsync
                    1.72%  3.8931ms       114  34.150us     908ns  2.1667ms  cuDeviceGetAttribute
...

Unified Memory profiling result:
Device "(0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       2  8.0000KB  4.0000KB  12.000KB  16.00000KB  6.208000us  Host To Device
       3  5.3330KB  4.0000KB  8.0000KB  16.00000KB  5.600000us  Device To Host
Total CPU Page faults: 3

Profile of dot and fill kernels with managed unified memory can be found below. Kernel times are significantly increased. When Unified Memory profiling results are examined, there are GPU page faults, page throttles, memory trashes, all of which do not exist in the profiling result of managed unified memory. There are significant increases in memory transfers between host and device and vice versa, and also in CPU page faults. It is clear that there is a problem with system allocated unified memory usage.

Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   70.42%  11.069ms         1  11.069ms  11.069ms  11.069ms  void dot<unsigned long=1024, Vector<float>>
                   29.58%  4.6486ms         1  4.6486ms  4.6486ms  4.6486ms  void fill<Vector<float>>
      API calls:   83.27%  194.88ms         2  97.442ms  53.009us  194.83ms  cudaMallocManaged
                    6.77%  15.846ms         5  3.1693ms  4.7490us  11.192ms  cudaDeviceSynchronize
                    4.19%  9.8066ms         2  4.9033ms  124.32us  9.6823ms  cudaFree
                    3.98%  9.3054ms         6  1.5509ms  7.1940us  8.1489ms  cudaMemPrefetchAsync
                    1.66%  3.8820ms       114  34.052us     907ns  2.1730ms  cuDeviceGetAttribute
...

Unified Memory profiling result:
Device "(0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      38  8.6309KB  4.0000KB  44.000KB  328.0000KB  117.5360us  Host To Device
      39  7.4863KB  4.0000KB  44.000KB  292.0000KB  125.3130us  Device To Host
      91         -         -         -           -  7.705400ms  Gpu page fault groups
       5         -         -         -           -  2.243987ms  Page throttles
      23  4.0000KB  4.0000KB  4.0000KB  92.00000KB           -  Memory thrashes
Total CPU Page faults: 36
Total CPU thrashes: 23
Total CPU throttles: 5

In order to get rid of page faults (at least mostly), a memory advice can be given as hinted in [1]. Eventually, the data structure is to be used by host and device. Data of the class is created by managed memory, so that should not be the cause of performance degradation. However, with system allocated unified memory approach, instances of the classes are automatic variables and are not meant to be used by device. By adding a simple API call to constructor and destructor of Vector, this problem could be solved.

template<class Type>
class Vector
{
  public:
  using SelfType = Vector<Type>;
  using ValueType = Type;
  using SizeType = std::size_t;
  Vector(SizeType size) : size_{size} 
  { 
    allocateUnifiedMemory();
    adviseOnMemory();
  }
  Vector(const Vector&) = delete;
  Vector& operator=(const Vector&) = delete;
  Vector(Vector&&) = delete;
  Vector& operator=(Vector&&) = delete;
  ~Vector() 
  {
    freeUnifiedMemory(); 
    adviseAgainstMemory();
  }
  // ...
  private:
  // ...
  void adviseOnMemory() const
  {
    checkCudaError(cudaMemAdvise(this, sizeof(SelfType), cudaMemAdviseSetAccessedBy, 0));
  }
  void adviseAgainstMemory() const 
  { 
    checkCudaError(cudaMemAdvise(this, sizeof(SelfType), cudaMemAdviseUnsetAccessedBy, 0)); 
  }
  ValueType *data_ = nullptr;
  SizeType size_ = 0;
};

Profiling result after the modification above can be found below. Clearly, there is no performance gap between the implementations. Since using system allocated memory is convenient, latter implementation should be preferred.

Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   50.06%  4.2094ms         1  4.2094ms  4.2094ms  4.2094ms  void dot<unsigned long=1024, Vector<float>>
                   49.94%  4.1998ms         1  4.1998ms  4.1998ms  4.1998ms  void fill<Vector<float>>
      API calls:   86.19%  196.75ms         2  98.376ms  61.530us  196.69ms  cudaMallocManaged
                    4.28%  9.7681ms         2  4.8841ms  111.40us  9.6567ms  cudaFree
                    3.93%  8.9722ms         5  1.7944ms  4.8190us  4.7941ms  cudaDeviceSynchronize
                    3.74%  8.5434ms         6  1.4239ms  5.7970us  8.0720ms  cudaMemPrefetchAsync
                    1.68%  3.8463ms       114  33.739us     908ns  2.1352ms  cuDeviceGetAttribute
...

Unified Memory profiling result:
Device "(0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       2  4.0000KB  4.0000KB  4.0000KB  8.000000KB  5.184000us  Host To Device
       3  5.3330KB  4.0000KB  8.0000KB  16.00000KB  8.000000us  Device To Host
       2  4.0000KB  4.0000KB  4.0000KB  8.000000KB           -  Remote mapping from device
Total CPU Page faults: 2
Total remote mappings to CPU: 2

Please note that profiled codes also include prefetching to obtain best performance. They are not included in the code samples provided here, because they are out of scope of Part I.

[1] CUDA C++ Programming Guide (2024) CUDA Toolkit Documentation 12.4. Available at: https://docs.nvidia.com/cuda/archive/12.4.0/cuda-c-programming-guide/index.html (Accessed: 11 May 2024).

[2] Harris, M. (2013) Unified memory in cuda 6, NVIDIA Technical Blog. Available at: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ (Accessed: 11 May 2024).

[3] Hwu, W.W., Kirk, D.B. and Hajj, I.E. (2023) Programming massively parallel processors: A hands-on approach. Cambridge, MA: Morgan Kaufmann.

[4] Sanders, J. and Kandrot, E. (2015) Cuda by example: An introduction to general-purpose GPU programming. Upper Saddle River etc.: Addison-Wesley/Pearson Education.