JIT fast! Supercharge tensor processing in Python with JIT compilation

The Lockheed SR-71 Blackbird holds the airspeed record for the fastest air-breathing aircraft to this date. The top velocity, 3,530km/h (Mach 2.88), was achieved on 28 July 1976 at an undisclosed location near Beale AFB, California, by an SR-71 Blackbird piloted by later Major General Eldon Joersz, USAF, and his RSO, Major George T. Morgan Jr., USAF. Designed to fly fast and high to evade detection and anti-aircraft missiles, but without offensive capabilities, the SR-71 was a pinnacle of engineering, but quickly became redundant: half a year after Joersz’s speed record, the National Reconnaissance Office launched the first Block I KH-11 KENNEN electro-optical reconnaissance satellite, ushering in an era that had little use for manned reconnaissance overflights and the attendant risk of losing pilots and aircraft. Speed isn’t everything — but it can take you a pretty long way.

Enter Numba.

Numba is what is called a JIT (just-in-time) compiler. It takes Python functions designated by particular annotations (more about that later), and transforms as much as it can — via the LLVM (Low Level Virtual Machine) compiler — to efficient CPU and GPU (via CUDA for Nvidia GPUs and HSA for AMD GPUs) code. While in Cython, you got the tools to use C types directly, but had to go out of your way to actually be able to do so, Numba does most of the heavy lifting for you.

import numpy as npdef numpy_LUdet(A: np.ndarray):
y = [1.0]
n = A.shape[0]

with np.errstate(invalid = 'ignore'):
for i in range(n):
y[0] = y[0] * A[i, i]
for j in range(i+1, n):
A[j][i] = A[j][i]/A[i][i]
A[j][i+1:] = A[j][i+1:] - (A[j][i] * A[i][i+1:])
import numpy as np
import numba
@numba.jit()
def numba_LUdet(A: np.ndarray):
y = [1.0]
n = A.shape[0]

with np.errstate(invalid = 'ignore'):
for i in range(n):
y[0] = y[0] * A[i, i]
for j in range(i+1, n):
A[j][i] = A[j][i]/A[i][i]
A[j][i+1:] = A[j][i+1:] - (A[j][i] * A[i][i+1:])

OK, so how does it work?

Unlike for Cython, we did not have to re-cast our code at all. It’s almost as if Numba knew what we wanted to do and created efficient precompiled code. It turns out that’s largely what it does: it analyses Python code, turns it into an LLVM IR (intermediate representation), then creates bytecode for the selected architecture (by default, the architecture the host Python runtime is running on). This allows additional enhancements, such as parallelisation and compiling for CUDA as well––given the near-ubiquitous support for LLVM, code can be generated to run on a fairly wide range of architectures (x86, x86_64, PPC, ARMv7, ARMv8) and a number of OSs (Windows, OS X, Linux), as well as on CUDA and AMD’s equivalent, ROC.

  • Unless running in nopythonmode (see below), Numba will attempt to generate optimised bytecode and, failing to do so, simply try to create a Python function (this is known as ‘object mode’ within Numba).

Object mode vs nopython mode

In general, the biggest boon of Numba is that unlike with Cython, you don’t need to rewrite your whole function. All you need to do is to prefix it with the jit decorator, as seen above. This puts Numba on autopilot, allowing it to determine whether it can do something about the code, and leave the function as it was written if it cannot. This is known as ‘object mode’ and means that if JIT compilation fails because some or all of the function body is not supported by Numba, it will compile the function as a regular Python object. Chances are, the result will still be faster as it may be able to optimise some loops using loop-lifting, however, so it’s definitely worth the try.

Compilation overhead

Because Numba’s JIT compiler has to compile the function to bytecode, there will be an inevitable overhead — often indicated by a pretty slow first run followed by tremendously faster subsequent runs. This is the time cost of JIT compiling a function. While compilation is almost always worth it and needs to be done only once, in performance-critical applications it makes sense to reduce compilation overhead. There are two principal ways to accomplish it with Numba: caching and eager compilation.

import math
import numba
@numba.njit
def lazy_hypotenuse(side1: int, side2: int) -> float:
return math.sqrt(math.pow(side1, 2) + math.pow(side2, 2))
import math
from numba import njit, float32, int32
@numba.njit(float32(int32, int32))
def eager_hypotenuse(side1: int, side2: int) -> float:
return math.sqrt(math.pow(side1, 2) + math.pow(side2, 2))

Invoking other JITted functions

As a general rule, Numba will not do recursive optimisation for you. In other words, if you invoke other functions you yourself defined from a JITted function, you must mark those for JITting separately — Numba will not JIT them just because they’re invoked in a JITted function. Consider the following example:


import numpy as np
from numba import njit, float32
from typing import List
def get_stdev(arr: List[float]):
return np.std(np.array(arr))
@njit(float32(float32[:]))
def get_variance(arr: List[float]):
return get_stdev(arr)**2

Just how fast is it?

To demonstrate the benefits of JIT, I’ve run a benchmark, in which I used a somewhat clumsy LU decomposition of square matrices from 10 x 10 to 256 x 256. As you can see, Numba-optimised NumPy code is at all times at least a whole order of magnitude faster than naive NumPy code and up to two orders of magnitude faster than native Python code. Directly invoked LAPACK code, written in FORTRAN 90 (via SciPy’s scipy.linalg.lu_factor(), a wrapper around the *GETRF routine in LAPACK), emerges as the clear winner at larger matrix sizes, and Cython’s performance turns out to be only slightly inferior to the optimised NumPy code.

LU decomposition benchmark for native Python, NumPy, optimised NumPy, LAPACK and Cython code. LAPACK was invoked using a SciPy wrapper. Optimised NumPy code is about an order of magnitude faster than ordinary NumPy code throughout, while up to two orders of magnitude faster than native Python. Cython code, on the other hand, is not significantly faster, whereas FORTRAN code only begins to lap optimised NumPy at relatively large matrix sizes. The ‘bang for a buck’ factor of optimising with Numba is clearly the highest — the NumPy code (orange) and the optimised NumPy code (crimson) differ only by the application of a single decorator.

There’s more to Numba than speed

Numba’s main job, of course, is to speed up functions. But it also does an excellent job at several other things. Perhaps my favourite among these is the @vectorize decorator, which can turn any old function into a NumPy universal function (often just called a ‘ufunc’). If you have a background in R, you might from time to time find yourself to be wistfully reminiscing about R’s ability to vectorise functions without much ado. A ufunc is a vectorized wrapper that generalises a function to operate on tensors represented as n-dimensional NumPy arrays (ndarrays), supporting tensor logic like broadcasting, internal buffers and internal type casting. An example is the function numpy.add(), which generalises the addition function (invoked via the addition operator, +) for tensors of any size — including tensors that are not the same size, where NumPy’s broadcasting logic is used to reconcile the tensors of different size.

Magic with Numba’s vectorisation decorator: a simple elementwise function can be generalised to higher-order tensors by nothing more than wrapping it in a decorator. This is, of course, rather inefficient, as failing to specify a signature for possible vectorisations means some type-specific optimisation cannot be carried out. For details on writing good vectorizable code in Numba, please refer to the documentation’s chapter on the vectorisation decorator.
Numba’s GPU parallelism framework compiles Python code to an NVVM (equivalent to LLVM) intermediate representation (IR) and PTX (compiled code). Both are serialised and sent to each cluster node. If the client and the cluster node have the same CUDA architecture (i.e. the same type of streaming multiprocessor (SM) on the GPU), the deserialised PTX code is translated into the CUDA binary. If the architecture differs, the NVVM IR is recompiled into PTX and eventually the CUDA binary.

Coda

Whether you work on computer vision, financial forecasting, weather models or deep neural networks: if your work in Python involves tensors and tensor operations, chances are, Numba’s JIT compiler can make a difference to your operations with as little as a single decorator. But Numba is also a rapidly developing ecosystem of primitives for high performance computing applications in Python that can be implemented in CUDA or a range of processor architectures and operating systems. With that in mind, it is no doubt one of the most interesting projects to date. Whether you come for the speed-up and stay for the HPC fun, or are just there to see how to accelerate Python code in quantitative applications, especially where tensors are involved, without massive coding overhead, Numba is something every data scientist working in Python should take a look at.

VP of Special Projects at Starschema, clinical computational epidemiologist, rower. Passionate about the potential of data science to improve the world.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store