Writing and integrating a Python native module into your project is time-consuming; is it worth it?
First, you need to choose the implementation framework. For example, with an AOT approach (Ahead-Of-Time generation of machine code), you can write your extension using a standard tool like Cython or choose a C/C++-based binding generator (e.g., boost::python, pybind11, CFFI) or finally, test a newcomer like Rust with the PyO3 framework.
Next, you need to decide what to optimize, find the critical code parts and decide how much of this code you need to move into the native module.
I will not develop available tools for building native extensions in this article. However, you can read a short review of several optimization techniques in my previous article.
Once the framework is chosen, the real question is, what do we want to optimize? What are the pros & cons of using a native module?
In this article, I set up a simple baseline code to measure the performance improvements that we can achieve and give some hints on the contexts in which a native call produces more overhead costs than benefits.
What to Measure?
In the following sections, I set up a basic benchmark to test the performance of native modules based on Rust/PyO3 and Cython. For this comparison, I have developed the same algorithm using three approaches:
- A pure Python implementation: the baseline of this benchmark.
- A Cython implementation with type hints.
- A Rust implementation with the PyO3 framework.
With the objectives:
- To measure the performance improvements of a native module vs. a pure Python approach.
- To measure the overhead of calling native Cython or Rust function from the Python interpreter.
- To understand the tradeoff between the performance improvements of the native extension and the impact of the overhead.
And some limitations:
- The test uses a function with only basic data types. On native modules with structured or custom data types, copying and transforming operations of these objects between the Python layer and the native code layer has a cost.
- The native code of the function
is_prime
use only basic test operators and mathematical operations in loop blocks. This is not a heavy computation task that can exploit SIMD/vectorization instructions, parallelism, threading, etc. On complex computation jobs, performance upgrades can be much higher.
How?
We start with a straightforward function based on my previous article.
This code provides a Rust-based implementation of the is_prime
function. To this baseline, we will add :
- A pure Python implementation of
is_prime
- A Cython implementation of
is_prime
. - A function to benchmark thousands of runs and trace them for further analysis.
Our Benchmark Function
Our benchmark will use a simple primality test function. This function is extracted from my previous article, “A Mixed Rust + Python Project.” The function is_prime
checks the primality of its input by dividing it by preceding numbers. For a number "num
”, we test the division remainder of the numbers between 2
and √num
.
→ Rust «is_prime» =
#[pyfunction] (1)
fn is_prime(num: u32) -> bool {
match num {
0 | 1 => false,
_ => {
let limit = (num as f32).sqrt() as u32; (2)
(2..=limit).any(|i| num % i == 0) == false (3)
}
}
}
- (1): The Rust macro
#[pyfunction]
generates code for Python binding. - (2): Compute the upper bound of our trial division series.
- (3): Generate the trials and test the remainder of the division.
Despite its simplicity, this function is a good candidate for our test because its computational complexity can be very different between our testing cases:
- Testing a prime number. For example, the number
12899
involves 112 tests before returning true. - Testing a non-prime number. For example, the previous number
12898
only needs 1 test before returning false.
Project Structure
Based on the defined code and structure of the article “A Mixed Rust + Python Project.” I’m only adding a bench directory with pure Python and Cython is_prime
implementation.
$ tree mybench
mybench
├── bench
│ ├── bench.py
│ └── cytest.pyx
├── Cargo.toml
├── pyproject.toml
├── src
│ └── lib.rs
└── test
└── test.py
3 directories, 6 files
To our pyproject.toml
file, we add a dependency to the click
package to manage our command-line arguments.
[project.optional-dependencies]
test = [
"hypothesis",
"sympy",
"click"
]
To achieve the best performance of the Rust code, we compile the project with a --release
flag to enable optimization.
$ cd mybench
$ maturin develop --release --extras test
Pure Python Implementation
Our pure python implementation is straightforward, with only a “for loop” testing our trials. The <<pure_python_is_prime>>
code is integrated to the bench.py
file.
→ Python «pure_python_is_prime» =
def is_prime_py(num: int) -> bool:
if num == 0 or num == 1:
return False
else:
limit = math.sqrt(num)
for i in range(2, int(limit) + 1):
if num % i == 0:
return False
return True
We set up an “instrumented” version of this code to count an approximate number of operations linked to a specific num
input argument.
→ Python «instrumented_is_prime» =
def is_prime_py_instrumented(num):
if num == 0 or num == 1:
return (False, 1) (1)
else:
ntests = 0
limit = math.sqrt(num)
for i in range(2, int(limit) + 1):
ntests += 1
if num % i == 0:
return (False, ntests) (2)
return (True, ntests) (3)
- (1): if our number is 0 or 1, we return the tuple
(False, 1)
.False
is the primality test result and1
is the number of tests involved. - (2): if our trial’s division has a
0
remainder, thennum
is not a prime number. "ntests
“ is the test operations counter. - (3): final case, all the trial tests are negative,
num
is a prime number.
Cython Implementation
The Cython code is similar to Python; we add type information to speed up this code. Cython uses the type annotations to generate optimized native code.
→ Cython «mybench/bench/cytest.pyx» =
import math
def is_prime_cy(int num): (1)
cdef int i (2)
cdef int limit
if num == 0 or num == 1:
return False
else:
limit = int(math.sqrt(num))
for i in range(2, limit + 1):
if num % i == 0:
return False
return True
- (1): we declare our function argument as an
int
. - (2): we type our two temporary variables
i
andlimit
asint
.
Benchmark Function
This function will run the benchmark with two input arguments:
nb_runs
: the number of runs of our function. The final time is averaged between these runs.num
: the number to test as a prime number.
→ Python «bench_number» =
def bench_number(nb_runs: int, num: int):
run_infos = is_prime_py_instrumented(num) (1)
tm_rust = timeit.timeit(
stmt=lambda: mybench.is_prime(num), number=nb_runs
) (2)
tm_cython = timeit.timeit(
stmt=lambda: cytest.is_prime_cy(num), number=nb_runs
) (3)
tm_python = timeit.timeit(stmt=lambda: is_prime_py(num), number=nb_runs) (4)
return run_infos, tm_rust, tm_cython, tm_python
- (1): we run the
is_prime_py_instrumented
function to estimate our code's complexity (in terms of operations). - (2): we benchmark our Rust
is_prime
function, running itnb_runs
times. Thetimeit
function average the running time of each run. - (3) — (4): same logic for the Cython and pure-Python versions.
Benchmark Main Program
We finally write our main benchmark program integrating the previous code snippets, importing our Rust and Cython modules, and integrating a simple command-line interface.
→ Python «mybench/bench/bench.py» =
import click
import timeit
import math
import pathlib
import typing
import json
import sys
import pyximport
pyximport.install() (1)
import mybench (2)
import cytest (3)
(4)
<<pure_python_is_prime>>
(5)
<<instrumented_is_prime>>
(6)
<<bench_number>>
(7)
@click.command()
@click.option(
"--nb_runs",
type=int,
default=1000,
help="Number of runs.",
)
@click.option(
"--num",
type=int,
default=12899,
help="The number to test or upper bound \
if output_serie is defined.",
)
@click.option(
"--output_serie",
type=str,
default=None,
help="Output file for serie results.",
)
def bench(nb_runs: int, num: int, output_serie: str):
if output_serie is None: (8)
(
run_infos,
tm_rust,
tm_cython,
tm_python,
) = bench_number(nb_runs, num)
print(
"Running {} primality tests on {}, \
result '{}' with {} tests".format(
nb_runs,
num,
run_infos[0],
run_infos[1],
)
)
print("Bench Rust/PyO3 {:.3f} {}".format(tm_rust, mybench.is_prime(num)))
print("Bench Cython {:.3f} {}".format(tm_cython, cytest.is_prime_cy(num)))
print("Bench Python {:.3f} {}".format(tm_python, is_prime_py(num)))
else: (9)
prime_list = [tst_num for tst_num in range(2, num + 1) if is_prime_py(tst_num)]
with open(output_serie, "w+") as fp:
fp.write("num,nb_runs,result,nb_tests,\
tm_rust,tm_cython,tm_python\n")
for tst_num in prime_list[0::100]: (10)
(
run_infos,
tm_rust,
tm_cython,
tm_python,
) = bench_number(nb_runs, tst_num)
fp.write(
"{},{},{},{},{},{},{}\n".format(
tst_num,
nb_runs,
run_infos[0],
run_infos[1],
tm_rust,
tm_cython,
tm_python,
)
)
if __name__ == "__main__":
bench()
- (1):
pyximport
is a facility provided by Cython to enable direct compilation and import. We don’t have to write a customsetup.py
for building with thecython
transpiler. - (2): we import our
Rust
native module compiled inrelease
mode. - (3): we import our
Cython
native module. - (4): our pure Python function is defined in
<<pure_python_is_prime>>
code block. - (5): instrumented version defined in
<<instrumented_is_prime>>
code block. - (6): our benchmarking function, running our three implementations and defined in
<<bench_number>>
code block. - (7): we define a simple command-line interface for running experiments.
We can run our program using two modes:
- A single number is chosen for primality testing in the single experiment mode (8).
- In the series experiment mode (9). We provide the upper bound of the series. This mode allows testing runs with several complexities. We run the experiment on the generated prime number series (lower than the specified upper bound) for every 100 elements (10).
Results & Analysis
First, we can test our benchmark function command-line interface.
$ cd mybench
$ python bench/bench.py --help
Usage: bench.py [OPTIONS]
Options:
--nb_runs INTEGER Number of runs.
--num INTEGER The number to test or upper bound if output_serie is
defined.
--output_serie TEXT Output file for serie results.
--help Show this message and exit.
In an initial test, we check the primality of the number 12899
with 1 000 000
runs.
$ cd mybench
$ python bench/bench.py --nb_runs 1000000 --num 12899
> Running 1000000 primality tests on 12899, result 'True' with 112 tests
> Bench Rust/PyO3 0.529 True
> Bench Cython 0.526 True
> Bench Python 4.349 True
This initial test shows that testing the number's primality 12899
needs 112
internal tests in our is_prime
function. The result is True
(12899
is a prime number). The run time for the pure Python version is 4.3s
. For Cython and Rust, we are roughly in the same performance results with 0.52s
.
In this case, we have a performance improvement ratio ofx12
between pure Python and Cython/Rust native extension.
We can check the cost of the Cython/Rust overhead.
$ cd mybench
$ python bench/bench.py --nb_runs 1000000 --num 1
By testing the special case "1
”, we have only one test executed in our is_prime
function. In this case, the cost of our computation is mainly the cost of calling the is_prime
function: we observe the call overhead of Cython and Rust versus pure Python.
> Running 1000000 primality tests on 1, result 'False' with 1 tests
> Bench Rust/PyO3 0.166 False
> Bench Cython 0.113 False
> Bench Python 0.134 False
We can see that with Cython, we don’t have a measurable overhead for this specific case. With Py03, a very small overhead appears.
We will now try a more general case. First, we will test all the prime numbers below a specified threshold and register their performance. We can then plot the performance improvements for several run settings.
$ cd mybench
$ python bench/bench.py --nb_runs 1000000 --num 100000 --output_serie runs.csv
This run generates an output file “runs.csv
” with the following format.
num,nb_runs,result,nb_tests,tm_rust,tm_cython,tm_python
2,1000000,True,0,0.1748378580014105,0.17820978199961246,0.3463963739995961
547,1000000,True,22,0.23610526800075604,0.24908054699881177,1.2034144149984058
1229,1000000,True,34,0.2719646179994015,0.2854190660000313,1.6644533799990313
1993,1000000,True,43,0.29658469800051535,0.3141682330006006,1.961818502000824
2749,1000000,True,51,0.32396009900003264,0.33482105799885176,2.2520081020011276
3581,1000000,True,58,0.3524247669993201,0.35497432999909506,2.4976866420001897
4421,1000000,True,65,0.37128741100059415,0.37557477799964545,2.7788527150005393
5281,1000000,True,71,0.3890202149996185,0.394951800999479,3.0130012549998355
6143,1000000,True,77,0.4090290809999715,0.41361685099946044,3.184734868000305
First, a plot of our run trace. On the x-axis is the number of tests (complexity of the run). On the y-axis is the average run time of the three implementations.
import pandas as pd
df = pd.read_csv("mybench/runs.csv")
sdf = df[["nb_tests", "tm_rust", "tm_python", "tm_cython"]]
sdf.plot(x="nb_tests", kind="line")
Without surprise, we can see that the average run time scale linearly with increasing complexity on the three implementations. The pure Python implementation run-time increase with a steeper slope than native implementations. Cython and Rust/PyO3 performances are comparables.
Second, a plot of the ratio between pure Python runs and native (Rust + Cython averaged) runs.
import pandas as pd
df = pd.read_csv("mybench/runs.csv")
df["tm_native"] = (df["tm_rust"] + df["tm_cython"]) / 2.0
df["ratio"] = df["tm_python"] / df["tm_native"]
sdf = df[["nb_tests", "ratio"]]
sdf.plot(x="nb_tests", y="ratio", kind="scatter")
In this case, we observe that the improvement ratio increase as more compute-heavy code is moved into the native module.
Wrap-up
A simple wrap-up:
- With our test case, the call overhead of Cython and Rust/PyO3 are comparable with a slight advantage for Cython.
- As more computational complexity is moved into the native extension, the performance improvement increase. For example, if you use a native function within a loop, you gain performance by moving the loop code into the extension.
- For simple functions without loops, using a native extension can be overkill.
- This test uses simple data types. If you use more complex data types (for example, lists or arrays), moving these types between Python and the extension comes with a cost. Therefore, it is helpful to implement these types natively and expose them in the Python interpreter. It is the approach used by NumPy [1]: a raw C memory buffer exposed in the Python interpreter.
References
[1] Van Der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The NumPy array: a structure for efficient numerical computation. Computing in science & engineering, 13(2), 22–30.