Speeding up your code (4): in-time compilation with Numba

3 min readMar 6, 2018

From this series:

The example of the mean shift clustering in Poincaré ball space
Vectorizing the loops with Numpy
Batches and multithreading
In-time compilation with Numba (this post)

In the previous posts we worked with our minds in order to speed up a (relatively) simple algorithm. Probably there are other smart ways to squeeze up more execution time, but nothing very interesting came to my mind.

So, it’s time to pass to the brute force. But still, I will avoid to use GPUs or TPUs. This because, as I already shown in this other post, often you loose so much time in moving the data from the system memory to the GPU (or TPU) that, at the end, the whole process will result as slower.

But there are other ways of exploiting the brute force, i.e. of improving the results without squeezing too much our brains.

One of them is Numba. It’s a tool that takes our function and “compile” them, which means that it translate them in a low level and code which (roughly speaking) speak the same language of the CPU. And so it’s faster.

Numba at work

Asking Numba to compile a function is easy as writing four letters before the definition of the function, “@jit”:

The ‘at’ symbol in Python is known as ‘function decorator’. They tell the interpreter to alter the functionality of the defined function. In our case, the decorator tell the interpreter to modify the function following the instructions of Numba. You can find more informations about such niceties here.

Actually Numba has lots of very useful tools within. One for all: it can do an automatic parallelization of the code. Of course, an ‘automatic’ tool is in general unable to deal with every kind of possible parallelization in a function, so don’t expect miracles from it. Since we already managed to manually parallelize the code, we will not use this functionality.

So we will just have to add the decorator before the function definitions? Almost. Since Numba is still in high development, not all the Numpy functions that we use are supported. And so, we have to make some minor modification.

In particular, the np.tile function is still not supported. An easy workaround is to create a void array of the desired shape, and fill it with the copies of the vectors that we need:

You can see that we feed the decorator with two parameters:

The first one, nopython=True is telling Numba to actually compile the function. It may seem redundant, and actually in our case it is. But in the cases where Numba is unable to compile the function, it will be executed in the usual Python interpreted way, and so we will loose the functionality we want to achieve. By adding this parameter, Numba will raise an error in such cases, and so we can dig in to understand where it falls. So it is a good practice to add it anyway.
The second one, nogil=True tells the Python interpreter to release the Global Interpreter Lock. A description of the GIL functionality goes beyond the scope of this post, but in short: if we keep it with the compiled code, we loose the parallelization, so we have to release it.

The rest of the code is exactly as in the previous post , with the only exception of placing the decorator with the two arguments, @jit(nopython=True, nogil=True) in the other two called function (__shift, gaussian). Note that the main one, meanshift_parallel, cannot be compiled in the way it is built.

Performance improvement

Now let’s switch to the numbers. How much speed we have gained?

Well, we actually gained execution time. But there is a saturation effect: with the growing number of vectors, the two execution speed approach more and more.

This is the final post from this series.

Thank you for reading!

Speeding up your code (4): in-time compilation with Numba

Numba at work

Written by Vincenzo Lavorini