The Life of a Numba Kernel: A Compilation Pipeline Taking User Defined Functions in Python to CUDA Kernels
Numba is the Just-in-time compiler used in RAPIDS cuDF to implement high-performance User-Defined Functions (UDFs) by turning user-supplied Python functions into CUDA kernels — but how does it go from Python code to CUDA kernel? In this post, we’ll take a look at Numba’s compilation pipeline.
If you enjoy diving into Numba’s internals, check out the accompanying notebook that shows each stage in more depth, with code provided to get at the internal representations with each pipeline stage, along with the linking, loading, and kernel launch process.
The Challenge, and an Overview of the Pipeline
Compiling Python source code to machine code is challenging because the two representations are quite different. Some of the differences are:
This comparison is not exhaustive, but it highlights the problem — Numba needs to take a very expressive, dynamic language, and translate it into one that uses very simple, specific instructions specialized for particular types. Numba’s pipeline does this with a sequence of stages, each moving the representation further from the Python source, and closer to executable machine code:
We’ll walk through seven stages of the pipeline:
- Bytecode Compilation
- Bytecode Analysis
- Numba Intermediate Representation (IR) Generation
- Type Inference
- Rewrite Passes
- Lowering Numba IR to LLVM IR
- Translating from LLVM to PTX with NVVM
(1/7) Bytecode Compilation
Numba doesn’t start with the source code itself; first, it uses the Python interpreter’s bytecode compiler to generate bytecode, which is much simpler to interpret than the source. For a Python function like:
A disassembly of the bytecode produced is:
Bytecode uses a few primitive operations (
BINARY_ADD, etc.) that operate on a stack — there are no variables as such, but the disassembly does annotate the instructions with the variable names from the source code.
This is a good starting point for a Python compiler because bytecode has only a small number of different kinds of instructions and opcodes to handle. However, to get to the machine code, we still have the following problems to solve:
- There is still no type information in the bytecode representation.
- Although real machines can execute code using a stack for all operands, it is very inefficient to do so — all machines have registers that can be used to hold temporary values (variables) which are much quicker to access.
- The Python bytecode instructions do not have a close correspondence with machine code instructions.
(2/7) Bytecode Analysis
Numba needs to translate the bytecode into the Numba Intermediate Representation (Numba IR) that it can use to work out the types of all values in the function. The first step towards building the Numba IR is to gather some information about the structure of the bytecode — how control can transfer from one block of the function to another through branches and loops, and how data flows from one variable to another.
This is determined by the Control Flow Analysis (CFA) and Data Flow Analysis (DFA), which produce a Control Flow Graph (CFG) and Data Flow Graph (DFG) respectively. The CFG for our example consists of:
The CFG is built up from the Basic Blocks of the function — a Basic Block is a sequence of instructions that executes in a straight line with no branches. In this function there are three basic blocks, numbered
48. The adjacency list gives the control flow between the basic blocks — the existence of an edge implies control can flow from the source block to the destination block.
Knowing which nodes dominate which other nodes are also useful for various analyses — a node
A dominates the node
B if all control flow paths to
B must pass through
A. Post-dominators are a bit like the reverse of a dominator — a node
C post-dominates a node
D if all paths starting at
D must pass through
C to exit the graph.
The text-based representation contains lots of information but is not that easy to interpret — Numba also provides a visual rendering of the CFG which is easier to understand:
The graph, in this case, is rather simple — in block
0, we call
cuda.grid(1) — then, if
i < len(r), the control flow goes to block
48, otherwise, it goes straight to
48 (the exit block).
We can also examine the DFG. The complete analysis is long, so we will just look at one block’s worth:
For each instruction in the block (
8, etc.) the variable usages (
‘item’) and the variable definitions (
‘res’) are recorded.
(3/7) Numba IR Generation
Numba uses the bytecode, the CFG, and the DFG to construct its own representation of the function. The IR for our example looks like:
Examining the IR shows some correspondences between it and:
- The Python bytecode — we can see some similar semantics, for example, the loading of globals and methods, and the calling of methods
- The names of the variables in the IR and the names in the Data Flow Analysis
- The Control Flow Analysis — the structure of the control flow of the Numba IR matches that in the bytecode and produced by the CFA.
However, it is also a little different from the previous items:
- The Numba IR abstract machine is not a stack machine, but instead has an infinite number of variables (
- It doesn’t explicitly hold the results of analyses, like the dominator and post-dominator analysis.
We can also view the Numba IR as a graph:
(4/7) Type Inference
Type inference starts with the types of the arguments passed into the function and propagates this information through the function until everything has been typed. It uses:
- Internally stored mappings of input types to output types. These prescribe rules such as “adding a
float64will result in a
- The DFG.
When there are two incoming types for a variable, e.g. when two different values can be assigned to it because of branching, Numba constructs a set of possible types for the variable, and attempts to unify them — that is, choose a type that all of the types in the set can be safely coerced to. Our
axpy running example has no unification, so we’ll consider unification using a different example:
In the above example
r is typed as
float32 in the taken branch, and
int32 in the not-taken branch. On the line with the return, the control flow from the
else re-joins, so the types of
r from both sides of it are used to construct the set of types. Since the set of types contains different types (
float32), Numba uses a built-in rule for unification — in this case, to ensure the best representation of values from all branches, it chooses
(5/7) Rewrite Passes
At this point, Numba runs some rewrite passes on the IR to optimize and simplify it. Some of these rewrites are:
- Some constant propagation, e.g. where one side of a binary operation has a constant value.
getitems where the index is known to be a constant.
raiseexpressions with constant arguments so they can be used in nopython mode.
These rewrites are mostly quite specific to Python code — for more general optimization, LLVM is used in the next stage.
(6/7) Lowering Numba IR to LLVM IR
The next stage is Lowering, which generates the LLVM Intermediate Representation (LLVM IR). The LLVM IR can be thought of as a machine-independent assembly language, and it has a relatively small instruction set of fairly low-level/primitive instructions that compute things like add, multiply, load, and so on. LLVM IR can be translated into many different target assembly languages — these include x86, ARM, PowerPC, PTX, and several others.
The process of translating the typed Numba IR into LLVM IR is relatively mechanical — there are many functions registered inside Numba for lowering the Typed Numba IR into LLVM IR, which mostly translate one IR instruction at a time, generating the equivalent LLVM IR. Once the LLVM IR is produced, it is run through LLVM’s own optimizers — this saves implementing many optimizations in Numba.
An excerpt of the LLVM IR for our BLAS L1-like function:
We can see a similar structure to the bytecode, CFG, and Numba IR with three basic blocks (
B48). The code inside the blocks is a little different — it can be seen that it is much closer to an assembly language than either the Python bytecode or Numba IR, with simple instructions operating on an infinite set of variables (
(7/7) Going from LLVM to PTX with NVVM
PTX is the Instruction Set Architecture (ISA) that provides a stable programming model and instruction set for CUDA GPUs. LLVM has a built-in JIT Compiler that is used for the CPU target, but for generating PTX it uses NVVM, which is an LLVM-based PTX generator from NVIDIA with much better support for CUDA devices than the in-tree LLVM PTX backend.
The body of the PTX emitted for our example is:
We can see the three basic blocks — in the first, the equivalent of
cuda.grid(1) is computed using intrinsics such as
%tid.x — then a branch jumps to the end of the function if this thread would access beyond the end of the array (the
bra instruction) with the following two basic blocks implementing the a * x +y computation, and the return at
At this point, Numba’s compilation pipeline is finished, and it is over to Numba’s Driver API wrapper to link, load, and execute the PTX. These operations are all covered in the accompanying notebook and maybe the subject of a follow-up blog post.
The Numba compilation pipeline starts with Python source code and takes it through the following stages to generate PTX code for CUDA GPUs:
- Bytecode Analysis generates the Control Flow Graph (CFG) and Data Flow Graph (DFG)
- The CFG, DFG, and bytecode are used to generate the Numba Intermediate Representation (Numba IR)
- Type Inference determines the type of every value in the Numba IR by propagating type information from the arguments through to the return values
- Rewrite passes optimize Typed Numba IR
- Lowering translates the Typed Numba IR into LLVM IR
- LLVM optimizes the IR
- The optimized LLVM IR is translated to PTX using NVVM
At this point, the Driver API is invoked to link, load, and launch the kernel.
Each step from Python source code to kernel launch is detailed with code exposing the internals at each step of the way in the accompanying notebook.
There are various future improvements to the pipeline for the CUDA target in the works:
- In the upcoming release of Numba 0.50, the pipeline can be invoked directly to compile a Python function to PTX, so that the pipeline output can be linked with output from another compiler, like JITify or NVRTC.
- Support for extending the CUDA target by adding new type definitions to the Type Inference stage and translations of Numba IR to LLVM IR to the lowering stage will be added after 0.50 — this will enable users to use custom data types with the CUDA target such as interval types and masked arrays.
- The CUDA pipeline is currently quite separated from the CPU pipeline — to get more features into the CUDA target, its pipeline will be evolved to share more infrastructure with the CPU pipeline in the future, so that new features added to Numba are more easily exploited from CUDA. The first steps towards this are in PR 5709.
The CUDA target has an active community and receives regular feature requests — if you’d like to start contributing to Numba:
- Head over to the Github Issues and check out the list of “Good First Issues” for the CUDA target.
- If you would like to get started but need some pointers, drop into the Numba Gitter channel and post your questions — someone will try and help you get moving forward quickly.
- Also, have a look at the Numba Developer Documentation and its Architecture Documentation.
- Some understanding of LLVM can be helpful when contributing to Numba. The Language Reference documentation provides a good overview of all LLVM IR instructions
- Numba accesses LLVM through llvmlite, a Python implementation of the IR-building instructions, and a wrapper for the LLVM libraries. The documentation for llvmlite is also helpful for working with Numba.