Vector — An owning type efficiently accessible to both host and device in CUDA C++

CUDA C++ Modern Unified Memory Programming — Part I

emre avcı
8 min readMay 12, 2024

In Part I of the series, Vector data structure in C++, which can be accessed by both host (CPU) and device (GPU) efficiently, is to be loosely implemented to be used with dot product kernel to be implemented in Part II of the series.

One of the challenges of heterogeneous programming is memory. The usual approach to handle memory was to allocate memory on both host and device and to copy data explicitly from host to device or from device to host as needed. There are many examples of the usual approach in [4], which might discourage beginners.

With the inclusion of Unified Memory in the programming model, a “bridge” between CPU and GPU is constructed. Thanks to that bridge,

  • allocating memory on both sides (host and device) is not necessary because memory is migrated on demand,
  • memory is accessible through a single pointer, whereas with the usual approach, host and device pointers differ,
  • complexity of the programs is decreased [2].

Last but not least, Unified Memory opens the door to use C++ more in CUDA programs, which is the main point of this article. The importance of using C++ more frequently is very evident, when the programmer writes code that is not very readable due to lack of abstractions. For instance, despite being a very good resource, if not the best, for learning CUDA, sample codes in the book [3] are written in CUDA C. As a consequence, even indexing a flattened array which represents a row-major-ordered matrix is difficult to follow and understand, especially as the algorithm gets sophisticated, e.g., code for thread coarsening for tiled matrix multiplication in [3].

Another advantage of creating classes is the information which might be encapsulated in them such as dimension information of Vector or Matrix classes, which might be used in kernels or in class member functions. That information is generally passed by value to kernels, increasing the argument count, when classes are not used, as in code samples of [4]. In the following sections, different implementations of Vector class are to be presented.

A non-owning type passed by value

A class relying on shallow copy might be used via pass-by-value as in code sample in Shared Memory subchapter of [1]. Vector class might be implemented as follows.

template<class Type>
struct Vector
{
using ValueType = Type;
using SizeType = std::size_t;
// ...
ValueType *data_ = nullptr;
SizeType size_ = 0;
};

However, that requires users of the class to handle memory, which is not very convenient. Since instances of the class pass by value to kernel, the memory cannot be handled by the class itself via RAII (Resource Acquisition Is Initialization).

#include <iostream>
#include <numeric>

// ...

template<class Type>
constexpr Type zero = 0.0;

template<class Type>
constexpr Type one = 1.0;

int main()
{
constexpr int N = 1024 * 1024 * 256;
constexpr int blocksPerGrid = 256 * 4 * 2;
constexpr int threadsPerBlock = 1024;
using ValueType = float;
using SizeType = std::size_t;
using VectorType = Vector<ValueType>;

VectorType vectorOfOnes;
vectorOfOnes.size_ = N;
cudaMallocManaged(&vectorOfOnes.data_, N * sizeof(ValueType));

fillPassByValue<<<blocksPerGrid, threadsPerBlock>>>(vectorOfOnes, one<ValueType>);

VectorType partials;
partials.size_ = blocksPerGrid;
cudaMallocManaged(&partials.data_, blocksPerGrid * sizeof(ValueType));

dotPassByValue<<<blocksPerGrid, threadsPerBlock>>>(partials, vectorOfOnes, vectorOfOnes);
cudaDeviceSynchronize();

ValueType value = std::accumulate(partials.cbegin(), partials.cend(), zero<ValueType>);
std::cout << value << std::endl;

cudaDeviceSynchronize();
cudaFree(vectorOfOnes.data_);
cudaFree(partials.data_);
return 0;
}

An owning type passed by reference relying on managed memory

In order to pass by reference of instances of classes relying on RAII for automatic memory management, new and delete operators might be overloaded as follows as in [2].

struct Managed
{
void *operator new(std::size_t len)
{
void *ptr;
checkCudaError(cudaMallocManaged(&ptr, len));
checkCudaError(cudaDeviceSynchronize());
return ptr;
}
void operator delete(void *ptr)
{
checkCudaError(cudaDeviceSynchronize());
checkCudaError(cudaFree(ptr));
}
};

Classes need to inherit from Managed class. For instance, Vector class could be implemented as below.

template<class Type>
class Vector : public Managed
{
public:
using SelfType = Vector<Type>;
using ValueType = Type;
using SizeType = std::size_t;
Vector(SizeType size) : size_{size} { allocateUnifiedMemory(); }
Vector(const Vector&) = delete;
Vector& operator=(const Vector&) = delete;
Vector(Vector&&) = delete;
Vector& operator=(Vector&&) = delete;
~Vector() { freeUnifiedMemory(); }
// ...
private:
// ...
ValueType *data_ = nullptr;
SizeType size_ = 0;
};

Although copy constructor above is explicitly deleted, it could also be implemented as in [2] . The following is a possible use case of the implementation above.

#include <iostream>
#include <memory>
#include <numeric>

// ...

template<class Type>
constexpr Type zero = 0.0;

template<class Type>
constexpr Type one = 1.0;

int main()
{
constexpr int N = 1024 * 1024 * 256;
constexpr int blocksPerGrid = 256 * 4 * 2;
constexpr int threadsPerBlock = 1024;
using ValueType = float;
using VectorType = Vector<ValueType>;

auto ptrVectorOfOnes = std::make_unique<VectorType>(N);
VectorType& vectorOfOnes = *ptrVectorOfOnes;

fill<<<blocksPerGrid, threadsPerBlock>>>(vectorOfOnes, one<ValueType>);

auto ptrPartials = std::make_unique<VectorType>(blocksPerGrid);
VectorType& partials = *ptrPartials;

dot<<<blocksPerGrid, threadsPerBlock>>>(partials, vectorOfOnes, vectorOfOnes);
cudaDeviceSynchronize();

ValueType value = std::accumulate(partials.cbegin(), partials.cend(), zero<ValueType>);
std::cout << value << std::endl;

return 0;
}

Although this works fine, there are a few drawbacks of this method. First, heap allocations are expensive. Second, in order to pass by reference of the instance of the class, it is necessary to dereference the pointer. To relieve the burden of dereferencing every time, an alias might be created as above, which is just an extra line but still a burden. Thirdly, it is necessary for every class to inherit from Managed class. As denoted in [2], even classes, composed of classes inheriting from Managed class, need to inherit from Managed class.

An owning type passed by reference relying on system allocated memory

According to [1], system allocated memory is part of the Unified Memory, when the following requirements are satisfied:

  • modern Linux systems with HMM,
  • compute capability of at least 7.5
  • CUDA driver version 535+ installed with Open Kernel Modules.

Vector class (or any other class) does not need to inherit from Managed class anymore. No other change is required on the data structure for now. The real difference is experienced by the users of the class. Now, the code looks cleaner and much more like native C++, since it is very uncommon to have an object of type std::unique_ptr<std::vector<float>>.

#include <iostream>
#include <numeric>

// ...

template<class Type>
constexpr Type zero = 0.0;

template<class Type>
constexpr Type one = 1.0;

int main()
{
constexpr int N = 1024 * 1024 * 256;
constexpr int blocksPerGrid = 256 * 4 * 2;
constexpr int threadsPerBlock = 1024;
using ValueType = float;
using VectorType = Vector<ValueType>;

VectorType vectorOfOnes(N);
fill<<<blocksPerGrid, threadsPerBlock>>>(vectorOfOnes, one<ValueType>);

VectorType partials(blocksPerGrid);
dot<<<blocksPerGrid, threadsPerBlock>>>(partials, vectorOfOnes, vectorOfOnes);
cudaDeviceSynchronize();

ValueType value = std::accumulate(partials.cbegin(), partials.cend(), zero<ValueType>);
std::cout << value << std::endl;

return 0;
}

Comparing performance of two owning type implementations

Two different approaches are presented to pass an instance of Vector class by reference to kernel. It is clear that system allocated unified memory is convenient to use. Before discarding managed unified memory, that is, the one inheriting from Managed class, and moving on with system allocated unified memory, performance of both approaches must be examined and necessary changes in Vector class must be made. Profile of dot and fill kernels with managed unified memory can be found below.

Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 50.01% 4.1963ms 1 4.1963ms 4.1963ms 4.1963ms void fill<Vector<float>>
49.99% 4.1953ms 1 4.1953ms 4.1953ms 4.1953ms void dot<unsigned long=1024, Vector<float>>
API calls: 86.59% 196.31ms 4 49.076ms 7.0530us 196.14ms cudaMallocManaged
4.28% 9.7063ms 4 2.4266ms 12.572us 9.5718ms cudaFree
3.72% 8.4433ms 9 938.14us 4.3300us 4.1988ms cudaDeviceSynchronize
3.55% 8.0480ms 6 1.3413ms 7.1940us 7.9174ms cudaMemPrefetchAsync
1.72% 3.8931ms 114 34.150us 908ns 2.1667ms cuDeviceGetAttribute
...

Unified Memory profiling result:
Device "(0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 8.0000KB 4.0000KB 12.000KB 16.00000KB 6.208000us Host To Device
3 5.3330KB 4.0000KB 8.0000KB 16.00000KB 5.600000us Device To Host
Total CPU Page faults: 3

Profile of dot and fill kernels with managed unified memory can be found below. Kernel times are significantly increased. When Unified Memory profiling results are examined, there are GPU page faults, page throttles, memory trashes, all of which do not exist in the profiling result of managed unified memory. There are significant increases in memory transfers between host and device and vice versa, and also in CPU page faults. It is clear that there is a problem with system allocated unified memory usage.

Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 70.42% 11.069ms 1 11.069ms 11.069ms 11.069ms void dot<unsigned long=1024, Vector<float>>
29.58% 4.6486ms 1 4.6486ms 4.6486ms 4.6486ms void fill<Vector<float>>
API calls: 83.27% 194.88ms 2 97.442ms 53.009us 194.83ms cudaMallocManaged
6.77% 15.846ms 5 3.1693ms 4.7490us 11.192ms cudaDeviceSynchronize
4.19% 9.8066ms 2 4.9033ms 124.32us 9.6823ms cudaFree
3.98% 9.3054ms 6 1.5509ms 7.1940us 8.1489ms cudaMemPrefetchAsync
1.66% 3.8820ms 114 34.052us 907ns 2.1730ms cuDeviceGetAttribute
...

Unified Memory profiling result:
Device "(0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
38 8.6309KB 4.0000KB 44.000KB 328.0000KB 117.5360us Host To Device
39 7.4863KB 4.0000KB 44.000KB 292.0000KB 125.3130us Device To Host
91 - - - - 7.705400ms Gpu page fault groups
5 - - - - 2.243987ms Page throttles
23 4.0000KB 4.0000KB 4.0000KB 92.00000KB - Memory thrashes
Total CPU Page faults: 36
Total CPU thrashes: 23
Total CPU throttles: 5

In order to get rid of page faults (at least mostly), a memory advice can be given as hinted in [1]. Eventually, the data structure is to be used by host and device. Data of the class is created by managed memory, so that should not be the cause of performance degradation. However, with system allocated unified memory approach, instances of the classes are automatic variables and are not meant to be used by device. By adding a simple API call to constructor and destructor of Vector, this problem could be solved.

template<class Type>
class Vector
{
public:
using SelfType = Vector<Type>;
using ValueType = Type;
using SizeType = std::size_t;
Vector(SizeType size) : size_{size}
{
allocateUnifiedMemory();
adviseOnMemory();
}
Vector(const Vector&) = delete;
Vector& operator=(const Vector&) = delete;
Vector(Vector&&) = delete;
Vector& operator=(Vector&&) = delete;
~Vector()
{
freeUnifiedMemory();
adviseAgainstMemory();
}
// ...
private:
// ...
void adviseOnMemory() const
{
checkCudaError(cudaMemAdvise(this, sizeof(SelfType), cudaMemAdviseSetAccessedBy, 0));
}
void adviseAgainstMemory() const
{
checkCudaError(cudaMemAdvise(this, sizeof(SelfType), cudaMemAdviseUnsetAccessedBy, 0));
}
ValueType *data_ = nullptr;
SizeType size_ = 0;
};

Profiling result after the modification above can be found below. Clearly, there is no performance gap between the implementations. Since using system allocated memory is convenient, latter implementation should be preferred.

Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 50.06% 4.2094ms 1 4.2094ms 4.2094ms 4.2094ms void dot<unsigned long=1024, Vector<float>>
49.94% 4.1998ms 1 4.1998ms 4.1998ms 4.1998ms void fill<Vector<float>>
API calls: 86.19% 196.75ms 2 98.376ms 61.530us 196.69ms cudaMallocManaged
4.28% 9.7681ms 2 4.8841ms 111.40us 9.6567ms cudaFree
3.93% 8.9722ms 5 1.7944ms 4.8190us 4.7941ms cudaDeviceSynchronize
3.74% 8.5434ms 6 1.4239ms 5.7970us 8.0720ms cudaMemPrefetchAsync
1.68% 3.8463ms 114 33.739us 908ns 2.1352ms cuDeviceGetAttribute
...

Unified Memory profiling result:
Device "(0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 4.0000KB 4.0000KB 4.0000KB 8.000000KB 5.184000us Host To Device
3 5.3330KB 4.0000KB 8.0000KB 16.00000KB 8.000000us Device To Host
2 4.0000KB 4.0000KB 4.0000KB 8.000000KB - Remote mapping from device
Total CPU Page faults: 2
Total remote mappings to CPU: 2

Please note that profiled codes also include prefetching to obtain best performance. They are not included in the code samples provided here, because they are out of scope of Part I.

[1] CUDA C++ Programming Guide (2024) CUDA Toolkit Documentation 12.4. Available at: https://docs.nvidia.com/cuda/archive/12.4.0/cuda-c-programming-guide/index.html (Accessed: 11 May 2024).

[2] Harris, M. (2013) Unified memory in cuda 6, NVIDIA Technical Blog. Available at: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ (Accessed: 11 May 2024).

[3] Hwu, W.W., Kirk, D.B. and Hajj, I.E. (2023) Programming massively parallel processors: A hands-on approach. Cambridge, MA: Morgan Kaufmann.

[4] Sanders, J. and Kandrot, E. (2015) Cuda by example: An introduction to general-purpose GPU programming. Upper Saddle River etc.: Addison-Wesley/Pearson Education.

--

--