More Performance for Less with RAPIDS cuDF Python Scalars

Brandon B. Miller
RAPIDS AI
Published in
10 min readJan 27, 2021

RAPIDS cuDF ultimately is a tool that allows users to get more work done processing tabular data for less time and effort. To that end, as engineers working on cuDF, our mission is to continuously look for places to make things more efficient without creating a headache for the programmer. Due to the design of GPUs and their CPU-cooperative nature, we often find exactly that kind of overhead woven naturally into the translation process between the high-level Python operations requested by the user, and the far simpler sequence of GPU operations required to obtain the desired result. These are things like extra memory transfers between the CPU and GPU memory spaces, extra GPU synchronization, or unnecessary copying. For performance-critical workflows, these inefficiencies can add up quickly and really matter, that is why from RAPIDS release 0.17 forward we are supporting a Python abstraction around a GPU backed scalar value. This blog post will talk about the technical problem cuDF scalars solve, and demonstrate how they can be integrated into a workflow to produce nontrivial speedups for minimal effort.

Translation Overhead

GPU scalars are a way of generating a single value that lives in a GPU device memory location and persists as long as the Python object lives. While libcudf (the c++ library underlying cuDF Python) has supported scalar objects for some time, up until release 0.17 these scalar values were always generated “just in time” to do their work by the Python layer, and then disposed of. Let’s take a simple example and peel back the layers of exactly what happens by default when, for instance, we add a scalar value to a cuDF Series:

0 2
1 3
2 4
dtype: int64
Outline of the internal process for adding a cuDF series object to a Python scalar. Yellow represents Python objects that live in the host memory space, which may or may not be backed by some associated GPU memory (green). To perform the sum, the cuDF series code first checks to make sure the two operands can logically be added together. If they can, a temporary scalar value is created in GPU memory and is used to calculate the result. In the end, a new series object is returned to the user, and the intermediate, GPU based scalar value is disposed of. However, using a cuDF scalar, many of these steps can be skipped, most significantly the overhead from creating a destroying the scalar value in question.

When something is added to a cuDF object, it ends up being the cuDF object itself that examines the other operand, determines if the desired operation is valid, and then selects the procedure for computing the result. In this case, the cudf.Series code recognizes that the other operand is a Python scalar and knows that a scalar value can be added to a Series sensibly. This process starts with cuDF attempting to construct a typed NumPy scalar around the incoming scalar value. Next, it creates a temporary GPU based scalar value (NOT a cuDF Python scalar) with the appropriate type — this involves allocating device memory and performing a host to device copy. Finally, both the series data and the scalar value are handed to a compiled libcudf routine to compute the result. Apart from returning the result of that computation to the user plus a little post-processing, that’s the process. But importantly, cuDF has no reason to believe that the GPU scalar is needed after this point, so it’s released from GPU memory. If you want to repeat the operation, you need to reallocate memory for it and repeat the transfer. This is the overhead that a cuDF scalar aims to eliminate by allowing the scalar to persist and get reused for subsequent operations. A simple example of this is something like the following:

The problem here is that from an API perspective, cuDF has no way of knowing what values get reused and what values will not for the general case of any code users could write in Python. cuDF can’t be concerned with the fate of val without overstepping its domain and mutating an int that it has no conceptual ownership of. cuDF Python scalars however provide a way for the user to signal to cuDF that it should hold onto that scalar and ties its lifetime to the lifetime of the Python object. We’ll probe the internals of this in the next section but for now, let’s use NVTX to compare the previous code block to the same code using a cuDF scalar in place of the variable val. NVTX is a profiling tool used to diagnose performance bottlenecks that you can read much more about here in its documentation. It allows one to wrap Python functions with a special decorator that marks when, in time, that function begins and ends execution. Annotating high-level cuDF functions along with their callees allows us to build plots like the one below and create a very powerful visualization of exactly how long a function spends performing its individual tasks. Here I have used it to encapsulate the block of code in the previous example as “SUM OPERATION” (blue), as well as the piece of code in cuDF that is responsible for constructing the correct GPU scalar based on a Python host-side scalar as “SCALAR_TO_DEVICE” (green).

The routine’s CUDA-based callees show up in red under “CUDA API” range. As expected, the work consists largely of two calls to cuDF’s BINARY_OP routine, which is annotated by default inside cuDFs Python layer. And, as expected, the GPU scalar is constructed twice, even though the same Python scalar is reused. Now, let’s use the cudf.Scalar API, which is detailed in the next section, to perform the equivalent operation, and generate the same NVTX ranges:

In this case, the memory allocation and copying associated with recreating the device scalar is absent, and the whole process is reduced to a tiny green dot (the code still runs, it just has very little actual work to do). The other important, but more subtle detail is that in both views, the scalar allocation and copy only start after the binary operation itself starts. This is actually because the scalar object executes copies between memory spaces on an “as needed” basis, which ends up being important for maintaining performance across more general Python workflows. We’ll probe a little bit of the internals, later on, to see why this is the case.

The cuDF Scalar API

RAPIDS cuDF scalars are designed to walk, talk and act very similar to NumPy scalars. They can be created in one of two ways, a host side way, and a device side way. The host side method is what users will work within almost every case. Here’s an example:

Scalar(1, dtype=int64)

As might be evident from the above, a scalar is typed and carries a dtype attribute, which can be set upon construction, but not mutated later.

Scalar(12, dtype=int8)
Scalar(hello, dtype=object)
Scalar(2011–01–01T00:00:00.000000000, dtype=datetime64[ns])

Currently, cuDF supports scalars for non-nested types, as such there are no cuDF scalars for Categorical, List, or Struct dtypes, which should come later. The resulting cuDF scalar objects can be used in mathematical operations against cuDF Series, DataFrame, and Index objects as well as other cuDF scalars, and will obey the same set of validity rules. Operations between cuDF scalars, or unary ops on cuDF scalars, are implemented through NumPy for compatibility and will result in a new cuDF scalar. You should get NumPy like error messages when you try and do something that isn’t allowed:

Scalar(2011–01–01T00:00:00.100000000, dtype=datetime64[ns])
TypeError: __add__ not supported between object and int64 scalars

This is all fine and good except for situations involving control flow, which is treated specially in Python through a magic method called __bool__, which is what’s actually called when you write statements like not x. When code like that executes, the objects __bool__ method is responsible for normalizing the object to a standard Python boolean, and subsequently whatever condition is being evaluated resolves. For this reason, __bool__ as well as __int__ and __float__ will return host side scalars. One can see the problem if we use a simple class that doesn’t do anything and then try and use it in control flow:

uh oh!

Since we don’t want code underneath statements such as if cudf.Scalar(False): to be executed, we return a host boolean in those cases. A more realistic situation could be one like this:

it works!

Examples like the above need to work consistently in the same way other scalar operations do in Python, or else very subtle bugs could occur in existing workflows — hence the choice to return host scalar. You’ll find the same result from calling __int__ or __float__, and it’s important to remember that if you’re working with a pure device-side scalar, that these methods imply a sync — more on this in the next section.

Cached Execution

RAPIDS cuDF Python scalars are designed to be efficient without users needing to do anything special with them, or keep track of if and when they need to be copied back and forth from the host memory space to that of the GPU. To facilitate this, they have a mutable internal state consisting of two flags (one for CPU, one for GPU) and a synching mechanism. When a user constructs a host-side scalar, such as in the following example, a cudf.Scalar object results, which upon construction collects all the metadata it would theoretically need to construct the equivalent GPU scalar, without actually creating that scalar. The “trigger” is the property cudf.Scalar.device_value — accessing this property causes the scalar object to allocate memory and perform the actual copy. This value is saved, meaning subsequent calls to device_value will just return the same object again without needing to rebuild it. cuDF is programmed to access this property when performing binary operations with other cuDF objects — that’s why the copy action happens as part of the same range as the binary operation in the first example of NVTX ranges.

Similarly, cuDF Python scalars that are constructed from the result of libcudf operations such as reductions will initially have a valid .device_value, and a blank .value (the equivalent host side accessor). .value works the same way as device_value does, but in reverse, meaning the first access will incur a device to host copy and construct the equivalent Python scalar, saving it as an attribute of the object. Subsequent accesses will give you back this value, no extra copying required.

CAUTION: Printing out a cuDF Scalar implicitly copies the value from the GPU to host memory, and incurs all the overhead required to do so. This is so it can show you what the value actually is — but this might change in the future!

1. Accessing device_value performs a host to device copy if necessary

2. Accessing value performs a device to host copy if necessary

3. Performing a binary operation between a cudf.Scalar and another cuDF object implies the use of device_value

4. Performing a host-side operation on a cudf.Scalar that was obtained as the result of a cuDF operation, like a reduction, implies the use of .value.

Let’s see how dropping a cuDF scalar into a representative workflow can result in some nearly free performance gains.

Iterative Processes

As an example of where a cuDF scalar might be useful from a performance perspective, let’s look at everyone’s favorite topic to blog about in machine learning — gradient descent. This is a fairly contrived example, but it serves as instructive for demonstrating a situation where simply using a cuDF scalar in place of a Python scalar can save wall-time without conceptually changing the code at all.

Sparing the reader the details of gradient descent, it can be summarized as an iterative process in which a computer attempts to find the minimum value of a function by starting at some point along that function’s surface and iteratively walking “down” — in the direction of steepest descent — using steps of a certain size. We’ll assume this step size is constant for this example, although it’s certainly important to note that in advanced versions of this technique that assumption does not necessarily hold. The notebook below demonstrates a basic walkthrough of this problem and is built around a single function.

The most important thing about the outer function is that it contains a switch that determines if a cuDF scalar will be used for the constant value (essentially the step size) or if a standard Python scalar will be used. The key is that in the case where a cuDF scalar is not used, cuDF is forced to allocate an entirely new GPU-based device scalar and recopy the same Python scalar value into that memory location before performing the binary operation between the vector of parameters and the scalar itself. This process takes a fairly nontrivial amount of time. Using a cuDF scalar however can mitigate these issues substantially by holding onto the same device scalar for reuse, essentially allowing a user to persist certain scalars at will, and as we will see, this simple change can result in substantial changes in GPU activity.

Over the course of a thousand iterations, the difference is clear — the process that uses the cuDF scalar is faster.

Conclusion

If you’re using any kind of constant value in your process, you can only benefit from using cuDF scalars. It may save you from forcing the GPU to do unnecessary work in subtle instances where it might be hidden within the logic of the particular workflow. In some iterative processes and situations that involve reusing the same constant over and over again, simply sprinkling cuDF scalars into the workflow can result in nontrivial speedup without any other code changes. The cuDF scalar should play nicely with other Python scalars, and other libraries without any issues, and will automatically sync itself to the host or device the minimum number of times necessary for the workflow to execute. This should result in at worst, the same performance as a standard Python scalar, and at best, substantially better performance. We encourage you to try them out, report any bugs or issues you find with them, and let us know how they perform!

--

--