CUDA by Numba Examples: Streams and Events
Follow part 3 of this series to learn about streams and events in CUDA programming for Python
Part 3 of 4: Streams and Events
Introduction
In the first two installments of this series (part 1 here, and part 2 here), we learned how to perform simple tasks with GPU programming, such as embarrassingly parallel tasks, reductions using shared memory, and device functions. We also learned how to time functions from the host — and why that might not be the best way to time code.
In This Tutorial
In order to improve our timing capabilities, we will introduce CUDA events and how to use them. But before we delve into that, we will discuss CUDA streams and why they are important.
Click here to grab the code in Google colab.
This tutorial is followed by one more part: Part 4.
Getting Started
Import and load libraries, ensure you have a GPU.
Streams
When we launch a kernel from the host, its execution is queued in the GPU, to be done whenever the GPU has done all tasks launched previously.
Many tasks that the user launches in the device may depend on previous tasks, and “putting them in the same queue” makes sense. For example, if you are copying data asynchronously to the GPU to process it with a certain kernel, that copy must have finalized before the kernel runs.
But what if you have two kernels that are independent of each other, would it make sense to put them in the same queue? Probably not! For these cases, CUDA has streams. You can think of streams as separate queues, which run independently of each other. They can also run concurrently, that is, at the same time. This can vastly speed up total runtime when running many independent tasks.
Stream Semantics in Numba CUDA
We will take the two tasks we learned so far and queue them to create a normalization pipeline. Given a (host) array a
, we will overwrite it with a normalized version of it:
a ← a / ∑a[i]
For that we will use three kernels. The first kernel partial_reduce
will be our partial reduction from Part 2. It will return a threads_per_block
-sized array, which we will pass to another kernel, single_thread_sum
which will further reduce that to a singleton array (size 1). This kernel will be run on a single block with a single thread. Finally, we will use divide_by
to divide in-place out original array but the sum we previously calculated. All of these operations will take place in the GPU, and should run one after the other.
When kernel calls and other operations are not given a stream, they run in the default stream. The default stream is a special stream whose behavior depends on whether one is running legacy or per-thread streams. For us, it will suffice to say that if you want to achieve concurrency, you should run tasks in non-default streams. Let’s see how to do that for some operations such as kernel launch, array copy and array creation copy.
Before we can actually talk about streams, we need to talk about the elephant in the room: cuda.pinned
. This context manager creates a special type of memory called page-locked or pinned memory, which CUDA will benefit when transferring memory from host to device.
Memory sitting in the host RAM can be paged at any moment, that is, the operating system can surreptitiously move objects from RAM to hard disk. They do this so that objects used infrequently are moved to a slower memory location, leaving the fast RAM memory available for more urgently needed objects. What matters to us is that CUDA does not allow asynchronous transfers from pageable objects to the GPU. It does this to prevent a constant stream of very slow transfers: disk (paged) → RAM → GPU.
To transfer data asynchronously, we must then ensure that the data always sits in RAM by somehow preventing the OS from sneakily hiding it in the disk somewhere. This is where the memory pinning comes into play, it creates a context within which the argument will be “page-locked”, that is, forced to be in RAM. See Figure 3.2.
From then on, the code is pretty straight forward. A stream is created, after which it will be passed to every CUDA function that we want to operate on that stream. Importantly, the Numba CUDA kernel configuration (square brackets) requires the stream to be in the third argument, after the block dimension size.
WARNING: Generally, passing a stream to a Numba CUDA API function does not change its behavior, only the stream in which it runs. One exception is the copy from device to host. When calling device_array.copy_to_host()
(without arguments), the copy happens synchronously. When calling device_array.copy_to_host(stream=stream)
(with a stream), the copy will happen synchronously if device_array
is not pinned. The copy will only happen asynchronously if device_array
is pinned and a stream is passed.
INFO: Numba provides a useful context manager, to enqueue all operations within its context; when exiting the context, operations will be synced, including memory transfers. Example 3.1 can also be written as:
with cuda.pinned(a):
stream = cuda.stream()
with stream.auto_synchronize():
dev_a = cuda.to_device(a, stream=stream)
dev_a_reduce = cuda.device_array((blocks_per_grid,), dtype=dev_a.dtype, stream=stream)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype, stream=stream)
partial_reduce[blocks_per_grid, threads_per_block, stream](dev_a, dev_a_reduce)
single_thread_sum[1, 1, stream](dev_a_reduce, dev_a_sum)
divide_by[blocks_per_grid, threads_per_block, stream](dev_a, dev_a_sum)
dev_a.copy_to_host(a, stream=stream)
Decoupling Independent Kernels with Streams
Suppose we want to make normalize not one array, but multiple arrays. The operations for normalization of separate arrays are completely independent of each other. Therefore, it doesn’t make sense for the GPU to wait until one normalization ends before the next one begins. We should therefore separate these tasks into their separate streams.
Let’s see an example of normalizing 10 arrays — each using its own stream.
And now let’s compare to a single stream.
But which one is faster? When running these examples, I do not get a consistent improvement for total time when using multiple streams. There can be many reasons for this. For example, for streams to run concurrently, there must be enough space in local memory. In addition, we are timing from the CPU. While it can very hard to know if there is enough space in local memory, timing from the GPU is relatively easy. Let’s learn how!
INFO: Nvidia provides several tools for debugging CUDA, including for debugging CUDA streams. Look into Nsight Systems for more information.
Events
One of the issues with timing code from the CPU is that it will include many more operations other than that of the GPU.
Thankfully, it is possible to time directly from the GPU with CUDA events. An event is simply a time register of when something happened in the GPU. In a way it is similar to time.time
and time.perf_counter
, differently from those, we need to deal with the fact that while we are programming from the CPU, we want to time events from the GPU.
So, beyond instead of just creating timestamps (“recording” events), we need to ensure that events are synchronized with the CPU before we can access its values. Let us examine a simple example.
Events for Timing Kernel Execution
One useful recipe for timing GPU operations is by using context managers:
Events for Timing Streams
To end this installment of the series, we will use streams to get a better, more accurate view of whether our example is benefiting from streams of not.
Conclusion
CUDA is all about performance. In this tutorial you learned how to accurately measure execution time of kernels using events, in a way that can be used to profile your code. You also learned about streams and how they can be used to always keep your GPUs busy, as well as pinned or mapped arrays, and how they can improve memory access.