Extending Numba’s CUDA target with the high-level API

Published in

RAPIDS AI

3 min readOct 26, 2022

The Alps from a Plane. Credit: Paul Mahler.

Numba is the Just-In-Time compiler used in RAPIDS cuDF to implement high-performance User-Defined Functions (UDFs) by turning user-supplied Python functions into CUDA kernels. Numba supports many types and functions commonly used in CUDA kernels out-of-the-box; for those types and library functions it doesn’t support, it provides extension APIs for describing how to interpret the data structures behind the types, and how to generate the assembly code that implements compiled functions.

Numba’s CUDA target has supported the low-level extension API for many releases, and this is a powerful mechanism for implementing compilation of types and functions that has been used to implement core parts of cuDF’s support for nullable data, strings, and rows in UDFs. However, it takes significant expertise and effort to build extensions with the low-level API.

Since the recent release of Numba 0.56, the CUDA target supports the high-level extension API. This is a quicker and simpler way to write extensions that needs only minimal knowledge of Numba’s internals. It can be used to implement compilation of new functions, attributes, and methods by writing Python functions, and avoids the need for handwritten generation of LLVM Intermediate Representation (IR) code.

Examples

For this post we’ll take a quick look at a couple of examples — for further notes on the implementations and more examples, check out the accompanying notebook.

1D Array sum

The following demonstrates a simple use of the high-level API to implement support for the sum() method of 1D arrays:

The semantics of the sum() function are defined in sum_impl(), which is itself compiled down to PTX through Numba’s pipeline — notice how the extension itself is only written in Python, yet results in machine code.

This method can be used straight away in a kernel — for example:

Implementing Intrinsics

Some operations and computations can’t be expressed in pure Python — for these, the high-level API provides a way to write intrinsics that generate the IR or PTX that implements them.

This example uses an intrinsic to implement the CUDA clock64() function, which requires the emission of inline PTX that reads a special register:

The intrinsic returns two things — the typing signature of the function it implements, and a function that builds the LLVM IR implementing the function. The codegen() function here uses a fairly standard pattern for emitting inline PTX. For further information about LLVM IR in Numba’s pipeline, see The Life of a Numba Kernel.

We can now use cuda_clock64() in a kernel:

Next steps

How do you go from here to implementing your own extensions using the high-level API?

See the accompanying notebook for a set of examples that demonstrate how to implement functions, methods, attributes, and intrinsics with an explanation of each implementation.
Once you’ve understood the examples, begin to implement support for your own libraries or functions — pick a simple piece of functionality, try to adapt one of the examples to implement it, and build from there.
If you get stuck or need help, you can consult the references below for detailed information on the API, Numba internals, and LLVM, and you can ask questions in the Numba Gitter channel and on the Numba Discourse forum.
When you’ve made some progress, tell us about your extensions! Post a link to them on the Discourse forum, contribute them in a PR to numba-examples, or share them in a Github repo or in a blog post! Follow us on Twitter @rapidsai.

References / further reading

The accompanying notebook — contains examples of implementing support for a function, method, attribute, and an intrinsic.
Numba High-level extension API documentation — an overview of the high-level API and some examples.
llvmlite IR layer documentation — a useful reference to the API used when implementing intrinsics.
LLVM Language Reference manual, version 7 — useful for looking up details that are not covered in the llvmlite documentation — v7 matches the version of IR used in the CUDA target.
The Life of a Numba Kernel — an end-to-end explanation of Numba’s compilation pipeline from Python source all the way to execution on a GPU.