A (Simple) Native Python Extension Benchmark

MatthieuL
10 min readApr 17, 2022

--

Writing and integrating a Python native module into your project is time-consuming; is it worth it?

First, you need to choose the implementation framework. For example, with an AOT approach (Ahead-Of-Time generation of machine code), you can write your extension using a standard tool like Cython or choose a C/C++-based binding generator (e.g., boost::python, pybind11, CFFI) or finally, test a newcomer like Rust with the PyO3 framework.

Next, you need to decide what to optimize, find the critical code parts and decide how much of this code you need to move into the native module.

Crossing the Python/Native Module Layer— Image by Author

I will not develop available tools for building native extensions in this article. However, you can read a short review of several optimization techniques in my previous article.

Once the framework is chosen, the real question is, what do we want to optimize? What are the pros & cons of using a native module?

In this article, I set up a simple baseline code to measure the performance improvements that we can achieve and give some hints on the contexts in which a native call produces more overhead costs than benefits.

What to Measure?

In the following sections, I set up a basic benchmark to test the performance of native modules based on Rust/PyO3 and Cython. For this comparison, I have developed the same algorithm using three approaches:

  • A pure Python implementation: the baseline of this benchmark.
  • A Cython implementation with type hints.
  • A Rust implementation with the PyO3 framework.

With the objectives:

  • To measure the performance improvements of a native module vs. a pure Python approach.
  • To measure the overhead of calling native Cython or Rust function from the Python interpreter.
  • To understand the tradeoff between the performance improvements of the native extension and the impact of the overhead.
Native Extension Tradeoff — Image by Author

And some limitations:

  • The test uses a function with only basic data types. On native modules with structured or custom data types, copying and transforming operations of these objects between the Python layer and the native code layer has a cost.
  • The native code of the function is_prime use only basic test operators and mathematical operations in loop blocks. This is not a heavy computation task that can exploit SIMD/vectorization instructions, parallelism, threading, etc. On complex computation jobs, performance upgrades can be much higher.

How?

We start with a straightforward function based on my previous article.

This code provides a Rust-based implementation of the is_prime function. To this baseline, we will add :

  • A pure Python implementation of is_prime
  • A Cython implementation of is_prime.
  • A function to benchmark thousands of runs and trace them for further analysis.

Our Benchmark Function

Our benchmark will use a simple primality test function. This function is extracted from my previous article, “A Mixed Rust + Python Project.” The function is_prime checks the primality of its input by dividing it by preceding numbers. For a number "num”, we test the division remainder of the numbers between 2 and √num.

→ Rust «is_prime» =

#[pyfunction] (1)
fn is_prime(num: u32) -> bool {
match num {
0 | 1 => false,
_ => {
let limit = (num as f32).sqrt() as u32; (2)

(2..=limit).any(|i| num % i == 0) == false (3)
}
}
}
  • (1): The Rust macro #[pyfunction] generates code for Python binding.
  • (2): Compute the upper bound of our trial division series.
  • (3): Generate the trials and test the remainder of the division.

Despite its simplicity, this function is a good candidate for our test because its computational complexity can be very different between our testing cases:

  • Testing a prime number. For example, the number 12899 involves 112 tests before returning true.
  • Testing a non-prime number. For example, the previous number 12898 only needs 1 test before returning false.

Project Structure

Based on the defined code and structure of the article “A Mixed Rust + Python Project.” I’m only adding a bench directory with pure Python and Cython is_prime implementation.

$ tree mybench

mybench
├── bench
│ ├── bench.py
│ └── cytest.pyx

├── Cargo.toml
├── pyproject.toml
├── src
│ └── lib.rs
└── test
└── test.py

3 directories, 6 files

To our pyproject.toml file, we add a dependency to the click package to manage our command-line arguments.

[project.optional-dependencies]
test = [
"hypothesis",
"sympy",
"click"
]

To achieve the best performance of the Rust code, we compile the project with a --release flag to enable optimization.

$ cd mybench
$ maturin develop --release --extras test

Pure Python Implementation

Our pure python implementation is straightforward, with only a “for loop” testing our trials. The <<pure_python_is_prime>> code is integrated to the bench.py file.

→ Python «pure_python_is_prime» =

def is_prime_py(num: int) -> bool:
if num == 0 or num == 1:
return False
else:
limit = math.sqrt(num)
for i in range(2, int(limit) + 1):
if num % i == 0:
return False
return True

We set up an “instrumented” version of this code to count an approximate number of operations linked to a specific num input argument.

→ Python «instrumented_is_prime» =

def is_prime_py_instrumented(num):
if num == 0 or num == 1:
return (False, 1) (1)
else:
ntests = 0
limit = math.sqrt(num)
for i in range(2, int(limit) + 1):
ntests += 1
if num % i == 0:
return (False, ntests) (2)
return (True, ntests) (3)
  • (1): if our number is 0 or 1, we return the tuple (False, 1). False is the primality test result and 1 is the number of tests involved.
  • (2): if our trial’s division has a 0 remainder, then num is not a prime number. "ntests“ is the test operations counter.
  • (3): final case, all the trial tests are negative, num is a prime number.

Cython Implementation

The Cython code is similar to Python; we add type information to speed up this code. Cython uses the type annotations to generate optimized native code.

→ Cython «mybench/bench/cytest.pyx» =

import math

def
is_prime_cy(int num): (1)
cdef int i (2)
cdef int limit

if num == 0 or num == 1:
return False
else:
limit = int(math.sqrt(num))
for i in range(2, limit + 1):
if num % i == 0:
return False
return True
  • (1): we declare our function argument as an int.
  • (2): we type our two temporary variables i and limit as int.

Benchmark Function

This function will run the benchmark with two input arguments:

  • nb_runs: the number of runs of our function. The final time is averaged between these runs.
  • num: the number to test as a prime number.
→ Python «bench_number» =

def bench_number(nb_runs: int, num: int):
run_infos = is_prime_py_instrumented(num) (1)
tm_rust = timeit.timeit(
stmt=lambda: mybench.is_prime(num), number=nb_runs
) (2)
tm_cython = timeit.timeit(
stmt=lambda: cytest.is_prime_cy(num), number=nb_runs
) (3)
tm_python = timeit.timeit(stmt=lambda: is_prime_py(num), number=nb_runs) (4)
return run_infos, tm_rust, tm_cython, tm_python
  • (1): we run the is_prime_py_instrumented function to estimate our code's complexity (in terms of operations).
  • (2): we benchmark our Rust is_prime function, running it nb_runs times. The timeit function average the running time of each run.
  • (3) — (4): same logic for the Cython and pure-Python versions.

Benchmark Main Program

We finally write our main benchmark program integrating the previous code snippets, importing our Rust and Cython modules, and integrating a simple command-line interface.

→ Python «mybench/bench/bench.py» =

import click
import timeit
import math
import pathlib
import typing
import json
import sys
import pyximport

pyximport.install() (1)

import mybench (2)
import cytest (3)

(4)
<<pure_python_is_prime>>
(5)
<<instrumented_is_prime>>
(6)
<<bench_number>>


(7)
@click.command()
@click.option(
"--nb_runs",
type=int,
default=1000,
help="Number of runs.",
)
@click.option(
"--num",
type=int,
default=12899,
help="The number to test or upper bound \
if output_serie is defined.",
)
@click.option(
"--output_serie",
type=str,
default=None,
help="Output file for serie results.",
)
def bench(nb_runs: int, num: int, output_serie: str):
if output_serie is None: (8)
(
run_infos,
tm_rust,
tm_cython,
tm_python,
) = bench_number(nb_runs, num)
print(
"Running {} primality tests on {}, \
result '{}' with {} tests".format(
nb_runs,
num,
run_infos[0],
run_infos[1],
)
)
print("Bench Rust/PyO3 {:.3f} {}".format(tm_rust, mybench.is_prime(num)))
print("Bench Cython {:.3f} {}".format(tm_cython, cytest.is_prime_cy(num)))
print("Bench Python {:.3f} {}".format(tm_python, is_prime_py(num)))
else: (9)
prime_list = [tst_num for tst_num in range(2, num + 1) if is_prime_py(tst_num)]
with open(output_serie, "w+") as fp:
fp.write("num,nb_runs,result,nb_tests,\
tm_rust,tm_cython,tm_python
\n")
for tst_num in prime_list[0::100]: (10)
(
run_infos,
tm_rust,
tm_cython,
tm_python,
) = bench_number(nb_runs, tst_num)
fp.write(
"{},{},{},{},{},{},{}\n".format(
tst_num,
nb_runs,
run_infos[0],
run_infos[1],
tm_rust,
tm_cython,
tm_python,
)
)


if __name__ == "__main__":
bench()
  • (1): pyximport is a facility provided by Cython to enable direct compilation and import. We don’t have to write a custom setup.py for building with the cython transpiler.
  • (2): we import our Rust native module compiled in release mode.
  • (3): we import our Cython native module.
  • (4): our pure Python function is defined in <<pure_python_is_prime>> code block.
  • (5): instrumented version defined in <<instrumented_is_prime>> code block.
  • (6): our benchmarking function, running our three implementations and defined in <<bench_number>> code block.
  • (7): we define a simple command-line interface for running experiments.

We can run our program using two modes:

  • A single number is chosen for primality testing in the single experiment mode (8).
  • In the series experiment mode (9). We provide the upper bound of the series. This mode allows testing runs with several complexities. We run the experiment on the generated prime number series (lower than the specified upper bound) for every 100 elements (10).

Results & Analysis

First, we can test our benchmark function command-line interface.

$ cd mybench
$ python bench/bench.py --help

Usage: bench.py [OPTIONS]

Options:
--nb_runs INTEGER Number of runs.
--num INTEGER The number to test or upper bound if output_serie is
defined.
--output_serie TEXT Output file for serie results.
--help Show this message and exit.

In an initial test, we check the primality of the number 12899 with 1 000 000 runs.

$ cd mybench
$ python bench/bench.py --nb_runs 1000000 --num 12899

> Running 1000000 primality tests on 12899, result 'True' with 112 tests
> Bench Rust/PyO3 0.529 True
> Bench Cython 0.526 True
> Bench Python 4.349 True

This initial test shows that testing the number's primality 12899 needs 112 internal tests in our is_prime function. The result is True (12899 is a prime number). The run time for the pure Python version is 4.3s. For Cython and Rust, we are roughly in the same performance results with 0.52s.

In this case, we have a performance improvement ratio ofx12 between pure Python and Cython/Rust native extension.

We can check the cost of the Cython/Rust overhead.

$ cd mybench
$ python bench/bench.py --nb_runs 1000000 --num 1

By testing the special case "1”, we have only one test executed in our is_prime function. In this case, the cost of our computation is mainly the cost of calling the is_prime function: we observe the call overhead of Cython and Rust versus pure Python.

> Running 1000000 primality tests on 1, result 'False' with 1 tests
> Bench Rust/PyO3 0.166 False
> Bench Cython 0.113 False
> Bench Python 0.134 False

We can see that with Cython, we don’t have a measurable overhead for this specific case. With Py03, a very small overhead appears.

We will now try a more general case. First, we will test all the prime numbers below a specified threshold and register their performance. We can then plot the performance improvements for several run settings.

$ cd mybench
$ python bench/bench.py --nb_runs 1000000 --num 100000 --output_serie runs.csv

This run generates an output file “runs.csv” with the following format.

num,nb_runs,result,nb_tests,tm_rust,tm_cython,tm_python
2,1000000,True,0,0.1748378580014105,0.17820978199961246,0.3463963739995961
547,1000000,True,22,0.23610526800075604,0.24908054699881177,1.2034144149984058
1229,1000000,True,34,0.2719646179994015,0.2854190660000313,1.6644533799990313
1993,1000000,True,43,0.29658469800051535,0.3141682330006006,1.961818502000824
2749,1000000,True,51,0.32396009900003264,0.33482105799885176,2.2520081020011276
3581,1000000,True,58,0.3524247669993201,0.35497432999909506,2.4976866420001897
4421,1000000,True,65,0.37128741100059415,0.37557477799964545,2.7788527150005393
5281,1000000,True,71,0.3890202149996185,0.394951800999479,3.0130012549998355
6143,1000000,True,77,0.4090290809999715,0.41361685099946044,3.184734868000305

First, a plot of our run trace. On the x-axis is the number of tests (complexity of the run). On the y-axis is the average run time of the three implementations.

import pandas as pd

df = pd.read_csv("mybench/runs.csv")
sdf = df[["nb_tests", "tm_rust", "tm_python", "tm_cython"]]
sdf.plot(x="nb_tests", kind="line")
Run Trace Plot — Image by Author

Without surprise, we can see that the average run time scale linearly with increasing complexity on the three implementations. The pure Python implementation run-time increase with a steeper slope than native implementations. Cython and Rust/PyO3 performances are comparables.

Second, a plot of the ratio between pure Python runs and native (Rust + Cython averaged) runs.

import pandas as pd

df = pd.read_csv("mybench/runs.csv")
df["tm_native"] = (df["tm_rust"] + df["tm_cython"]) / 2.0
df["ratio"] = df["tm_python"] / df["tm_native"]
sdf = df[["nb_tests", "ratio"]]
sdf.plot(x="nb_tests", y="ratio", kind="scatter")
Performance Ratio Plot — Image by Author

In this case, we observe that the improvement ratio increase as more compute-heavy code is moved into the native module.

Wrap-up

A simple wrap-up:

  • With our test case, the call overhead of Cython and Rust/PyO3 are comparable with a slight advantage for Cython.
  • As more computational complexity is moved into the native extension, the performance improvement increase. For example, if you use a native function within a loop, you gain performance by moving the loop code into the extension.
  • For simple functions without loops, using a native extension can be overkill.
  • This test uses simple data types. If you use more complex data types (for example, lists or arrays), moving these types between Python and the extension comes with a cost. Therefore, it is helpful to implement these types natively and expose them in the Python interpreter. It is the approach used by NumPy [1]: a raw C memory buffer exposed in the Python interpreter.

References

[1] Van Der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The NumPy array: a structure for efficient numerical computation. Computing in science & engineering, 13(2), 22–30.

--

--

MatthieuL

Matthieu Lagacherie | Computer Science Geek | AI Architect / Personal views on tech, programming and ML