JIT fast! Supercharge tensor processing in Python with JIT compilation

The Lockheed SR-71 Blackbird holds the airspeed record for the fastest air-breathing aircraft to this date. The top velocity, 3,530km/h (Mach 2.88), was achieved on 28 July 1976 at an undisclosed location near Beale AFB, California, by an SR-71 Blackbird piloted by later Major General Eldon Joersz, USAF, and his RSO, Major George T. Morgan Jr., USAF. Designed to fly fast and high to evade detection and anti-aircraft missiles, but without offensive capabilities, the SR-71 was a pinnacle of engineering, but quickly became redundant: half a year after Joersz’s speed record, the National Reconnaissance Office launched the first Block I KH-11 KENNEN electro-optical reconnaissance satellite, ushering in an era that had little use for manned reconnaissance overflights and the attendant risk of losing pilots and aircraft. Speed isn’t everything — but it can take you a pretty long way.

At Starschema, we’re constantly looking for ways to speed up some of the computationally intensive tasks we’re dealing with. Since a good amount of our work involves image processing, this means that we’re in particular interested in anything that makes matrix computations — sometimes over fairly large tensors, e.g. high-resolution satellite or biomedical imagery––easier and faster. Because imagery often comes in multi-channel or even hyperspectral forms, anything that helps process them faster is a boon, shaving valuable seconds off that over large data sets can easily make days of difference.

Until relatively recently, it was not uncommon to write development code in a high-level language with good data science and machine learning support, like Python, but rewrite and deploy it in C or C++, for raw speed (indeed, one of the motivations behind Julia was to develop a language that would be fast enough not to require this!). Python is great for putting your quantitative ideas clearly and succinctly, but interior loops in Python have always been slow due to the absence of type information. Python’s duck typing system really comes to bite when this absence of typing creates unnecessary code and indirection, leading to relatively slow inner loops. Recently, however, solutions were envisaged to get around this problem. The first of these was Cython — injecting C types into your Python code. It is, on the whole, a rather painstaking method of speeding up your code, albeit a lot of computationally intensive code is written in Cython, including code you’ve almost definitely used — much of the SciPy stack, for instance, and almost all of SageMath, were written in Cython.

The problem is that ‘Cythonising’ your code can be time consuming, and often fraught with challenges that require a profound knowledge of C to solve. What if we had a better way to get efficient bytecode from our slow-but-intelligible Python code?

Enter Numba.

Numba is what is called a JIT (just-in-time) compiler. It takes Python functions designated by particular annotations (more about that later), and transforms as much as it can — via the LLVM (Low Level Virtual Machine) compiler — to efficient CPU and GPU (via CUDA for Nvidia GPUs and HSA for AMD GPUs) code. While in Cython, you got the tools to use C types directly, but had to go out of your way to actually be able to do so, Numba does most of the heavy lifting for you.

The simplest way to get started with Numba is as easy as affixing the @numba.jit decorator to your function. Let’s consider the following function, performing a simple and pretty clumsy LU factorisation:

import numpy as np
def numpy_LUdet(A: np.ndarray):
y = [1.0]
n = A.shape[0]

with np.errstate(invalid = 'ignore'):
for i in range(n):
y[0] = y[0] * A[i, i]
            for j in range(i+1, n):
A[j][i] = A[j][i]/A[i][i]
A[j][i+1:] = A[j][i+1:] - (A[j][i] * A[i][i+1:])

Note that as this is a measuring function, it does not return a value, it merely calculates the decomposition. As you can see, for an n x n square matrix, the runtime will be on the order of n², due to the nested iteration. What’s the best way to speed up this code?

We could, of course, rewrite it in Cython. Numba, on the other hand, offers us the convenience of simply imposing a decorator:

import numpy as np
import numba
@numba.jit()
def numba_LUdet(A: np.ndarray):
y = [1.0]
n = A.shape[0]

with np.errstate(invalid = 'ignore'):
for i in range(n):
y[0] = y[0] * A[i, i]
            for j in range(i+1, n):
A[j][i] = A[j][i]/A[i][i]
A[j][i+1:] = A[j][i+1:] - (A[j][i] * A[i][i+1:])

Through that simple decoration, the code already runs significantly faster (once, that is, the code has had a chance to compile in the first run) — approximately 23 times faster than NumPy code for a 10 x 10 matrix.

OK, so how does it work?

Unlike for Cython, we did not have to re-cast our code at all. It’s almost as if Numba knew what we wanted to do and created efficient precompiled code. It turns out that’s largely what it does: it analyses Python code, turns it into an LLVM IR (intermediate representation), then creates bytecode for the selected architecture (by default, the architecture the host Python runtime is running on). This allows additional enhancements, such as parallelisation and compiling for CUDA as well––given the near-ubiquitous support for LLVM, code can be generated to run on a fairly wide range of architectures (x86, x86_64, PPC, ARMv7, ARMv8) and a number of OSs (Windows, OS X, Linux), as well as on CUDA and AMD’s equivalent, ROC.

The drawback is that Numba by definition only implements a strict subset of Python. Fortunately, Numba handles this, in two ways:

  • Numba has very wide support for NumPy functions (see list here) and Python features (see list here) — although notably, it does not support context handlers (with expressions) and exception handling (try, except, finally) .
  • Unless running in nopythonmode (see below), Numba will attempt to generate optimised bytecode and, failing to do so, simply try to create a Python function (this is known as ‘object mode’ within Numba).

Object mode vs nopython mode

In general, the biggest boon of Numba is that unlike with Cython, you don’t need to rewrite your whole function. All you need to do is to prefix it with the jit decorator, as seen above. This puts Numba on autopilot, allowing it to determine whether it can do something about the code, and leave the function as it was written if it cannot. This is known as ‘object mode’ and means that if JIT compilation fails because some or all of the function body is not supported by Numba, it will compile the function as a regular Python object. Chances are, the result will still be faster as it may be able to optimise some loops using loop-lifting, however, so it’s definitely worth the try.

But where Numba really begins to shine is when you compile using nopython mode, using the @njit decorator or @jit(nopython=True). In this case, Numba will immediately assume you know what you’re doing and try to compile without generating Python object code (and throw an exception if it cannot do so). The difference in terms of execution time between object and nopython mode can range from 20% to 40 times (!).

In practice, I’ve found the best approach is to refactor and extract purely optimisable code, and optimise it in nopython mode. The rest can be kept as pure Python functions. This maximises overall optimisation gains without expending compilation overhead (more about which in the next section) unnecessarily.

Where object code is generated, Numba still has the ability to ‘loop-lift’. This means to ‘lift out’ a loop automatically from an otherwise non-JITtable code, JIT compile it, and treat it as if it had been a separate nopython JITted function. While this is a useful trick, it’s overall best to explicitly do so yourself.

Compilation overhead

Because Numba’s JIT compiler has to compile the function to bytecode, there will be an inevitable overhead — often indicated by a pretty slow first run followed by tremendously faster subsequent runs. This is the time cost of JIT compiling a function. While compilation is almost always worth it and needs to be done only once, in performance-critical applications it makes sense to reduce compilation overhead. There are two principal ways to accomplish it with Numba: caching and eager compilation.

The @jit decorator accepts a cache boolean argument. If set to True, it will cache the function it compiled into a file-based cache. In general, every time you open and run a Python script, everything that needs to be compiled by Numba gets compiled at that time. However, if you cache the compilation result, subsequent runs will be able to read the bytecode from the cache file. In theory, you can also distribute the cache file, but since Numba optimizes to your specific architecture (and supports a bewildering array of architectures, as described above), it may not work persistently. It nonetheless remains a good idea to cache functions, compile them once and use them all the time.

Eager compilation is a different way of solving the same problem. Admittedly, the naming is a little misleading — most of the time, these terms are used to indicate when something is compiled (at call time, i.e. lazy, vs. well in advance, i.e. eager). In this case, it refers to a related notion, but one that stretches over what is being compiled, too. Consider the following example:

import math
import numba
@numba.njit
def lazy_hypotenuse(side1: int, side2: int) -> float:
return math.sqrt(math.pow(side1, 2) + math.pow(side2, 2))

This is lazy compilation because––the Python typing annotations notwithstanding––we have not provided any information to Numba about the function’s possible arguments, and therefore, it will compile code at time of call depending on the type the values of side1 and side2 are taking. Eager compilation, on the other hand, rests on telling Numba well ahead of time what types to expect:

import math
from numba import njit, float32, int32
@numba.njit(float32(int32, int32))
def eager_hypotenuse(side1: int, side2: int) -> float:
return math.sqrt(math.pow(side1, 2) + math.pow(side2, 2))

The format @jit(<return>(<argument1>, <argument2>,...)) (or its @njit equivalent) will allow the Numba JIT compiler to determine types (check out the documentation for the type system in Numba), and based on that, pre-generate compiled bytecode. Note that if you have an eager compiled function and your arguments cannot be coerced into the format you specify, the function will throw a TypingError.

Invoking other JITted functions

As a general rule, Numba will not do recursive optimisation for you. In other words, if you invoke other functions you yourself defined from a JITted function, you must mark those for JITting separately — Numba will not JIT them just because they’re invoked in a JITted function. Consider the following example:


import numpy as np
from numba import njit, float32
from typing import List
def get_stdev(arr: List[float]):
return np.std(np.array(arr))
@njit(float32(float32[:]))
def get_variance(arr: List[float]):
return get_stdev(arr)**2

In this case, the computationally inexpensive second function will benefit from JIT, but all it does is a simple exponentiation. The computationally more expensive first function has not been annotated, and therefore will be run as a Python function — that is, much slower. To get the most out of Numba, the get_stdev() function should also have been provided with a JIT decorator (preferably @njit, since NumPy’s numpy.std() is implemented by Numba).

Just how fast is it?

To demonstrate the benefits of JIT, I’ve run a benchmark, in which I used a somewhat clumsy LU decomposition of square matrices from 10 x 10 to 256 x 256. As you can see, Numba-optimised NumPy code is at all times at least a whole order of magnitude faster than naive NumPy code and up to two orders of magnitude faster than native Python code. Directly invoked LAPACK code, written in FORTRAN 90 (via SciPy’s scipy.linalg.lu_factor(), a wrapper around the *GETRF routine in LAPACK), emerges as the clear winner at larger matrix sizes, and Cython’s performance turns out to be only slightly inferior to the optimised NumPy code.

LU decomposition benchmark for native Python, NumPy, optimised NumPy, LAPACK and Cython code. LAPACK was invoked using a SciPy wrapper. Optimised NumPy code is about an order of magnitude faster than ordinary NumPy code throughout, while up to two orders of magnitude faster than native Python. Cython code, on the other hand, is not significantly faster, whereas FORTRAN code only begins to lap optimised NumPy at relatively large matrix sizes. The ‘bang for a buck’ factor of optimising with Numba is clearly the highest — the NumPy code (orange) and the optimised NumPy code (crimson) differ only by the application of a single decorator.

Of course, Numba has its limitations. Importantly, it only helps to optimise a particular kind of problem — namely, processes where loops or other repetitive structures are included. For tensor operations and other nested loop/high cyclomatic complexity workloads, it will make a significant difference. Even where you need to restructure your code to fit in with Numba’s requirements, such restructuring is a lot easier in my experience than having to rewrite the whole thing in Cython. Acting at the same time as an interface to quickly generate not just faster CPU code but also GPU enabled code (via PyCuda) for a slightly more limited subset of functionalities (NumPy array math functions are not supported on CUDA, nor are NumPy math functions in general), Numba is worth exploring if your work involves nested loops and/or large or repetitive tensor operations. For writing numerical code, image processing algorithms and certain operations involving neural networks, it is rapidly becoming my tool of choice for writing heavily optimised, fast code.

There’s more to Numba than speed

Numba’s main job, of course, is to speed up functions. But it also does an excellent job at several other things. Perhaps my favourite among these is the @vectorize decorator, which can turn any old function into a NumPy universal function (often just called a ‘ufunc’). If you have a background in R, you might from time to time find yourself to be wistfully reminiscing about R’s ability to vectorise functions without much ado. A ufunc is a vectorized wrapper that generalises a function to operate on tensors represented as n-dimensional NumPy arrays (ndarrays), supporting tensor logic like broadcasting, internal buffers and internal type casting. An example is the function numpy.add(), which generalises the addition function (invoked via the addition operator, +) for tensors of any size — including tensors that are not the same size, where NumPy’s broadcasting logic is used to reconcile the tensors of different size.

Magic with Numba’s vectorisation decorator: a simple elementwise function can be generalised to higher-order tensors by nothing more than wrapping it in a decorator. This is, of course, rather inefficient, as failing to specify a signature for possible vectorisations means some type-specific optimisation cannot be carried out. For details on writing good vectorizable code in Numba, please refer to the documentation’s chapter on the vectorisation decorator.

Consider, for instance, the math.log10 function. This is an unvectorised function, intended to operate on single values (size-1 arrays, as the error message quoth). But by simply prepending Numba’s @numba.vectorize decorator, we can generalise the math.log10 function into a function operating elementwise over NumPy ndarrays representing tensors of pretty much any order (dimensionality).

@numba.guvectorize takes this one step further, allowing vectorisation that does not have to be element-wise: while @numba.vectorize is great for easily generalising a function that takes a single scalar input and returns a single scalar output to higher-rank tensors, @numba.guvectorize generalises this further by allowing functions operating on lower-order tensors to operate on higher-order tensors. An example would be to extend a 1-dimensional weighted moving average to operate on higher-order tensors (see the worked example in the Numba documentation). You can read more about @numba.guvectorize here. Unlike @numba.vectorize, @numba.guvectorize works via tensor pre-allocation, thus taking an argument of a tensor of a desired size, which it then ‘fills’ with the results, rather than returning the results at the end.

Numba’s GPU parallelism framework compiles Python code to an NVVM (equivalent to LLVM) intermediate representation (IR) and PTX (compiled code). Both are serialised and sent to each cluster node. If the client and the cluster node have the same CUDA architecture (i.e. the same type of streaming multiprocessor (SM) on the GPU), the deserialised PTX code is translated into the CUDA binary. If the architecture differs, the NVVM IR is recompiled into PTX and eventually the CUDA binary.

Numba also has parallelisation features, including parallelising a process to different CUDA architectures. While still experimental, this is one of Numba’s most interesting features, and relies on a clever framework to build up a cluster and transmit GPU-enabled functions to cluster workers, and Numba’s explicit support for serialisation (using cloudpickle) of functions. While this feature lacks the maturity of Numba’s JIT compiler and other core features, it is no doubt one of its most interesting future applications overall, and point towards Numba as the cornerstone of a distributed GPU enabled architecture.


Coda

Whether you work on computer vision, financial forecasting, weather models or deep neural networks: if your work in Python involves tensors and tensor operations, chances are, Numba’s JIT compiler can make a difference to your operations with as little as a single decorator. But Numba is also a rapidly developing ecosystem of primitives for high performance computing applications in Python that can be implemented in CUDA or a range of processor architectures and operating systems. With that in mind, it is no doubt one of the most interesting projects to date. Whether you come for the speed-up and stay for the HPC fun, or are just there to see how to accelerate Python code in quantitative applications, especially where tensors are involved, without massive coding overhead, Numba is something every data scientist working in Python should take a look at.