Improving Python’s Performance

I’m sure every one of us wants to combine Python’s expressivity with the performance and control of C/C++ and obtain great results. So let’s look into how we can achieve some of that with Cython and Pypy.

Below is a naive divisors counting function, implemented in Python:

Profiling

Now, let’s do some benchmarking and see how the above function performs. To do this we’ll take advantage of the iPython magic function: %timeit

In [2]: %timeit -r 3 -n 1000 count_divisors(10 ** 9)
1000 loops, best of 3: 3.37 ms per loop

As we can see, the best execution took 3.37 ms per loop for count_divisors(10 ** 9).

PyPy

Here’s a link to PyPy’s documentation on how to get started with it. Note that on mac os systems, it can be installed with homebrew:

brew install pypy
# We can also install ipython for pypy
pypy -m easy_install ipython
# For convenience, I've added an alias to pypy's ipython in my ~/.bash_profile
alias ipypython='/usr/local/share/pypy/ipython'

Now let’s try the same thing in pypy. Note that no modifications to the code are needed to run the method in pypy (it uses the same syntax as python).

In [2]: %timeit -r 3 -n 1000 count_divisors(10 ** 9)
1000 loops, best of 3: 422 µs per loop

Cython

Let’s see how we can improve these results by compiling the above function with Cython, without any modification at first. Remember that most Python code is Cython valid.

In order to be able to benchmark, we first have to load the Cython extension with %load_ext Cython and then use the %%cython just like below:

In [4]: %load_ext Cython
In [5]: %%cython
...:def count_divisors(n):
...: if n == 1:
...: return 1
...:
...: i, divisors = 2, 2
...: while i * i < n:
...: if n % i == 0:
...: divisors += 2
...: i += 1
...:
...: if i * i == n:
...: divisors += 1
...:
...: return divisors

Benchmarking:

In [6]: %timeit -r 3 -n 1000 count_divisors(10 ** 9)
1000 loops, best of 3: 1.96 ms per loop

As mentioned previously, Cython understands Python code, but to obtain even better performance we can convert the dynamically typed variables to the statically typed Cython ones; we use the cdef Cython statement to declare the statically typed variables just as in C. Cython’s documentation talks about how to choose which variables to statically define. In order to keep the code expressive and readable, you shouldn’t statically define everything!

For this example, we only statically defined n and i as C ints, seeing as arithmetic with those two makes up most of the CPU heavy lifting. Note that this means the two methods are not equivalent! You now have to be careful not to overflow int when doing arithmetic.

In [9]: %timeit -r 3 -n 1000 count_divisors(10 ** 9)
1000 loops, best of 3: 126 µs per loop

Conclusion

Simply using Cython with the Python code almost doubled the performance. That said, PyPy is clearly the best choice if we want to improve performance without modifying the Python code. While the small above benchmark might not seem to provide enough proof, it does support that the PyPy team’s claims that PyPy is 7x faster than Cython. Moreover, we’ve experimented with them on a bigger application, and we’ve observed similar results.

The best performance was obtained with Cython when static types were used. The statically typed version proven to be 15x faster than other Cython implementation and 26x faster that the normal Python version.

Though we managed to improve performance greatly with static types in Cython, there are some caveats to this approach. First, the code becomes less readable, and it takes time to figure out which variables are used in the bottleneck areas of the code to only use static types on those. Secondly, one has to be very careful with C types: arithmetic on them can overflow.


Liked the article?

We’re always looking for a great project to work on, so don’t hesitate drop us a line at hello[at]bluepixel.io or visit our website: http://bluepixel.io