Run Your Python User Defined Functions in Native CUDA Kernels with RAPIDS cuDF

Combining Python/CUDA JIT Compilation for Flexible Acceleration in RAPIDS

Published in

RAPIDS AI

4 min readJul 1, 2020

In this blog, we’ll introduce our design and implementation of a framework within RAPIDS cuDF that enables compiling Python user-defined functions (UDF) and inlining them into native CUDA kernels. Our framework uses the Numba Python compiler and Jitify CUDA just-in-time (JIT) compilation library to provide cuDF users the flexibility of Python with the performance of CUDA as a compiled language. An essential part of the framework is a parser that parses a function in one of the CUDA intermediate representations stage, which is compiled from the Python UDF, into an equivalent CUDA device function that can be inlined into native CUDA C++ kernels. Our approach makes it possible for Python users without CUDA programming knowledge to extend optimized dataframe operations with their own Python UDFs, which enables more flexibility and generality for high-performance computations on dataframes in RAPIDS.

We start by giving examples on how to use the feature, followed by the goals we intend to achieve. Finally, we explain how things work in the background to make the function possible.

How to Use the Feature

The feature is built into the framework of RAPIDS cuDF and is easy to use. Once a dataframe is created, simply call the interfaces that support this feature with the user-defined Python function. Currently, the list of support includes:

applymap, which applies a UDF to each of the elements.
rolling, which applies a range-based UDF to each of the windows.

In the following, we give examples with applymap and rolling.

The `applymap` example:

>>> import cudf
>>> import cudf.core
>>> from cudf.core import Series
>>> import numpy as np
>>> a = Series([9, 16, 25, 36, 49], dtype=np.float64)
>>> a.applymap(lambda x: x ** 2)
0      81.0
1     256.0
2     625.0
3    1296.0
4    2401.0
dtype: float64
>>> a.applymap(lambda x: 1 if x in [9, 44] else 2)
0    1
1    2
2    2
3    2
4    2
dtype: int64

The `rolling` example:

>>> def foo(A):
...     sum = 0
...     for a in A:
...         sum = sum + a
...     return sum
... 
>>> a.rolling(3, 1, False).apply(foo)
0      9.0
1     25.0
2     50.0
3     77.0
4    110.0
dtype: float64

What the Feature Intends to Achieve: Flexibility and Performance

Ahead-of-Time Compilation

Traditionally, with ahead-of-time compilation, CUDA kernels are compiled into SASS machine-level code at compile-time and launched at run time. In cases where operator functions need to be called by kernels, the use of function pointers or stack frame, which usually jeopardizes performance, are avoided by inlining the operator function, as shown in the following code:

Code after inlining. Note that this is just an illustration: the actual inlining happens at NVVM IR level.

Performance is achieved; however, at the price of flexibility. Often at compile-time, the operator function is not known. In most cases, the program does not reach the end-users until run time, and it is the users who are going to decide what operator function is needed. With ahead of time compilation, users cannot write their operator function without recompiling the whole program while still having the maximum performance.

Just-in-Time Compilation

Just-in-time (JIT) compilation, or run-time-compilation, helps. Utilizing CUDA runtime compilation (NVRTC) and the JITIFY library, the code string of the operator function, written at run time, can be inlined into the code string of the kernel (before the combination is compiled at run time) and launched with the same performance of a corresponding traditional native CUDA kernel. Flexibility and performance are optimized, with the only overhead being the time needed to perform the run time compilation.

Combine Python and CUDA

Combining the flexibility of Python as an interpreted language and the performance of CUDA as a compiled language will have broader coverage. A Python UDF can be written, without the knowledge or even awareness of CUDA, compiled and inlined into carefully optimized pre-defined CUDA kernels and launched on GPUs with maximum performance, as shown in the usage examples.

For more information about how Python is added to the workflow on top of NVRTC/JITIFY framework, check out my NVIDIA DevBlog on the topic.

A Performance Benchmark of `applymap`

We compare the performance of pandas.apply with cudf.applymap for dataframes with large numbers of rows and the later one is able to achieve significant speedup over the former one. The following benchmark is measured on an Intel(R) Xeon(R) Gold 6128 CPU and an NVIDIA Quadro GV100 GPU. Note that these results do not include the overhead of JIT compilation. This overhead is a one-time cost paid only on the first execution of this feature using a specific UDF.

Python code used to produce the benchmark.

Conclusion

Utilizing the benefits of Python and CUDA, the combined Python/CUDA JIT compilation in RAPIDS cuDF allows users to apply their Python functions on dataframes on NVIDIA GPUs with great flexibility while achieving the maximum performance. The feature’s idea of combining Python and just-in-time compilation applies beyond the scope of dataframe extract, transform, load (ETL), and has potentially many more use cases.