User Defined Functions in RAPIDS cuDF

Mike Beaumont
RAPIDS AI

--

By Yi Dong and Nick Becker

cuDF is a GPU DataFrame library that accelerates common data-preparation tasks like loading, filtering, joining, aggregating, etc. It provides a pandas-like API that should be familiar to data scientists, offering built-in functionality for data cleaning & munging that needs to happen prior to model training. But, no general purpose tool covers everything out of the box; sometimes, we need custom data transformations.

cuDF’s DataFrame class has two methods that let users run custom Python functions on GPUs: apply_rows and apply_chunks. In this tutorial, we’ll walk through how to use apply_rows and apply_chunks to create your own UDFs and show how you can implement a GPU-accelerated windowing function.

Creating UDFs with apply_rows and apply_chunks

apply_rows is a special case of apply_chunks, which processes each of the DataFrame rows independently in parallel. Under the hood, the apply_rows method will optimally divide the long columns into chunks, and assign chunks into different GPU blocks for parallel computation. In the example below, we use apply_rows to calculate the Haversine distance between two points in the input array and also print out the GPU block and grid allocation information. To open these Jupyter Notebooks directly on Github for a wider view, click on the bottom of the Gists.

In the example above, the sample output in the last cell shows that the for-loop in the kernel function is unrolled by the compiler automatically. It uses 15 CUDA blocks, each using 64 threads to do the computation. In this case, most threads in a block handle one element from the input array, but some threads have to deal with two elements, because there are 1000 rows and 960 threads (15 blocks * 64 threads per block). Additionally, due to the parallelism, the DataFrame rows are not necessarily processed in order

We can implement the same distance kernel with the apply_chunks method.

From the function arguments, we can see that apply_chunks gives us more control than the apply_rows method. We can specify how to divide the long array into chunks, map each of the array chunks to different GPU blocks to process (chunks argument) and assign the number of threads per block (tpb argument). The for-loop is also no longer automatically unrolled in the kernel function, instead serving as the for-loop for that GPU thread.

Each kernel corresponds to a thread in one block, with full access to all the elements in that chunk of the array. In this example with chunks=16, we try to uniformly cut the 100 elements into chunks of size 16, resulting in six blocks with a full 16 rows and the seventh block with four rows (for a total of 100 rows). Each block has eight threads to process this length 16 subarray (or length four for the last block), since we set tpb=8.

Window Functions

In the financial services industry, data scientists often need to compute features from time series data. One of the most popular ways to process time series data is to compute a moving average, as if you were sliding a window across your array. In the following example, we’ll show how to utilize apply_chunks to speed up computing moving averages for a long array:

In the above code, we divide the array into subarrays of size trunk_size and send those subarrays to GPU blocks to compute the moving average with moving_average_kernel. However, there is no history for the elements at the beginning of the subarray. To account for this, we shift the chunk division by an offset of average_window. Then, we call fill_missing_average_kernel to compute the moving average of only those missing records.

Note that when we used fill_missing_average_kernel, we didn’t define the outcols argument for apply_chunks, as it would create a new GPU memory buffer and overwrite the old out array. Instead, we reuse the existing out array as input. For an array of 100 million records, cuDF takes about 0.6 seconds to do the computation using an NVIDIA Tesla V100 GPU.

Conclusion and Next Steps

We’ve shown how cuDF supports GPU-accelerated User Defined Functions through the apply_rows and apply_chunks methods. For many use cases, we can get significant speed increases by leveraging the GPU DataFrame.

With that said, the code in this post is not optimized for performance. There are a couple of things we could do to make it faster. First, we could use shared memory to load the array and reduce the I/O when doing the summation. Second, the threads are doing a lot of redundant summation. We could maintain an accumulation summation array to help reduce that redundancy. These tasks are outside the scope of this tutorial.

Because of how useful GPU-accelerated window functions are, we’re actively developing native cuDF support for rolling window calculations mirroring pandas’s DataFrame.rolling API. While we develop native support, you can use a numba-based extension to the cuDF Python API for rolling window functions and exponential weighted moving averages.

Want to get started with cuDF and RAPIDS? Check out cuDF on Github and let us know what you think! You can download pre-built Docker containers from NGC or Dockerhub to get started, or install yourself via Conda and Pip.

--

--