When Python is not enough. Rust and Cython extensions

Published in

Exness Tech Blog

11 min readSep 23, 2022

Co-authored by Roman Smirnov

Python is your first choice for data analysis and machine learning. In most of the cases it’s performance is more than enough. But what if your data volume becomes so huge compared to your resources that it needs a speed-up?

In this article we will solve several simple problems using pure Python and two extension languages: Rust and Cython. To compare these approaches we will estimate the laboriousness of each and also see in which cases you should go for something else. We will start with the installation and “Hello World” apps, and then pass on to more complex experimentations.

What’s wrong with Python?

Python is one of the most popular programming languages, especially in the fields of data analysis, data science, machine learning, deep learning, and AI technologies. Python’s popularity is no surprise since it has a simple syntax and dynamic typing, and there’s no need to manage the system’s RAM.

You also don’t have to bother announcing a variable type upon its creation because Python is an interpreted language, not a compiled one.

But among these advantages, there are two major drawbacks of Python, and it’s better to know about them before your data volume grows and the processing becomes complex and custom.

Here they are:

Processing speed
Limited parallelization capability due to the Global Interpreter Lock (GIL), a direct consequence of Python being an interpreted language.

Let’s put aside parallel calculation for now and come back to it in future articles. Here I’d like to discuss Python processing speed and how it may be improved by using the language extensions in Rust and Cython.

Most comparisons of these languages focus on isolated tests against Python only. In this article, we will compare pure Python, Rust, and Cython on several tasks.

Extensions in compiled languages, e.g. Rust and Cython, let you separate the most computationally expensive operations into detached modules written in another, more efficient programming language. After compilation, these modules can be imported and used within Python code via the classic import procedure. Here’s an example:

import my_module

Where “my_module” is a compiled extension.

Do you need custom extensions for your Python code?

Before rushing to write custom Python extensions on high-performance programming languages, make sure that Python performance limit is reached for your task. It’s also worth keeping in mind these factors:

The algorithm itself should be close to optimal. Reassembling an inefficient algorithm with another programming language and using it as a Python extension might end up with less performance increase than simple Python optimization;
There is a range of Python libraries available, allowing you to process data fast enough, for example, Numpy, Pandas, Dask, Vaex, Polars, PyTorch, and others. You can try them to speed up your Python code, as they are mostly written in compiled programming languages and work fast. PyTorch stands out here as it’s used not only for neural network design but also for matrix calculations speed up, as it uses parallel computation on graphic processor units (GPU). In this context PyTorch seems to be analogous to Numpy, but for GPU computation instead of CPU computation.
And, of course, if the data is imported from a relational database, it’s highly recommended to use SQL as much as possible. SQL is a very fast language of structured queries, which can help with data sorting, aggregation, and a range of other useful operations.

Therefore, it’s generally advised that you develop custom extensions for Python only if your problem can’t be solved by a couple of lines of code from the aforementioned libraries or by SQL. However, when some kind of specific data transformation, splitting data into custom intervals, or other complex processing is needed, you may find it very useful. There are cases when using Rust or Cython increased algorithm performance by more than 40 times, as in the case of long sequence transformations, for example, audio signals.

At Exness we use such extensions for preprocessing data for machine learning models to improve calculation efficiency in production. For example, we can use it to calculate a client lifecycle stage segmentation for splitting a time series into intervals between two deposits given that that period exceeds 10 days.

What are Rust and Cython?

Within this article, we will consider Rust and Cython compiled languages capable of creating extensions for Python code. While Cython was initially created for this, Rust is a standalone, relatively new programming language, which is swiftly gaining popularity among programmers. It is used for web development, machine learning (for instance, there is a BERT version for Rust), and other applications.

Let’s solve a complex serializing problem, using pure Python, Cython, and Rust, and compare its effectiveness under different circumstances.

Getting started with Rust and Cython

To begin with, let’s write a simple Python extension, which will return the first element of a passed array.

Rust

To install Rust just follow the official instructions.

To get started let’s create a project folder, named “rust_processing”. The folder looks like this:

rust_processing
- Cargo.toml
- src
- - lib.rs

Cargo.toml describes the extension we’re creating:

[package]
name = "rust_processing"
version = "0.1.0"
authors = ["MyName"]
edition = "2018"[lib]
name = "rust_processing"
crate-type = ["cdylib"][dependencies]
rand = "0.8.4"[dependencies.cpython]
version = "0.5"
features = ["extension-module"]

Make sure that the “name” variables are the same and coincide with the extension’s name. This name will be used further to import the extension into Python.

Besides Cargo.toml we also need a folder named “src” containing lib.rs script (rs — is a native Rust code extension). Inside it we will place the exact code of our extension, which is the following:

# To merge Rust and Python we need to ensure conversion of codes of the two languagesextern crate cpython;use cpython::{PyResult, Python, py_module_initializer, py_fn};py_module_initializer!(rust_processing, |py, m| {
    m.add(py, "__doc__", "This module is implemented in Rust")?;
    m.add(py, "return_first", py_fn!(py, return_first(array: Vec<String>)))?;
});fn return_first(_py: Python, array: Vec<i32>) -> PyResult<i32> {
    Ok(_return_first(&users))
}fn __return_first(array: &Vec<i32>) -> i32 {
    array[0]
}

It’s important to note that this function may take both a Python List or a Numpy array. The latter will be faster.

You can notice in the code above that Rust is statically typed and demands pointers’ handling. It might look complicated. However, it’s getting gradually easier for a person, who has never worked with compiled programming languages, to start working with Rust. It has an actively growing community, detailed documentation, and explicit compilation errors description.

To compile the code, open a terminal window in the project directory and run the following command:

cargo rustc --release -- -C link-arg=-undefined -C link-arg=dynamic_lookup

This one will work for macOS. For Linux or Windows it’s even simpler:

cargo rustc --release

Once the compilation is complete, you will find a “target” folder in the project directory. Inside it, in the “release” folder there will be librust_processing.dylib file (in Linux it will have .so extension instead of .dylib, in Windows — .dll). Next you have to change the extension to .so and name to the one we have written in cargo.toml.

These changes must also be done through the terminal:

cp target/release/librust_processing.dylib ./rust_processing.so.

Finally, if we place rust_processing.so into the folder with the target Python script and just import it from within, then the return_first function will be available from Python.

Cython

To use Cython simply install it via pip:

pip install cython

Then create two files: setup.py and cython_processing.pyx. Setup.py describes the compilation procedure:

from setuptools import setup
from Cython.Build import cythonize
import numpysetup(
    ext_modules=cythonize("cython_processing.pyx"),
    include_dirs=[numpy.get_include()]
)

And cython_processing.pyx describes the module functionality:

import numpy as np
cimport numpy as np
cimport cythoncpdef list return_first(np.ndarray array):
    return array[0]

Looks much more like Python, right?

It’s worth mentioning that Cython doesn’t necessarily require predetermination of all data types including outputs. But it might significantly improve the computation performance. We’ll show it below.

To compile it just run the following command in the terminal:

python setup.py build_ext --inplace

Now you have a file with the .so extension, which can be imported into the Python code.

Benchmarks

Description

To compare the performance of our extensions let’s solve several problems with arrays for different datasets:

Using pure Python,
With Rust optimization,
With Cython optimization.

And the first one is…

First Element

Let’s start with a very simple task and return the first element of the passed array. Here are the implementations in Python, Cython, and Rust.

Pure Python:

def return_first(numbers):
    return numbers[0]

Cython:

import numpy as np
cimport numpy as np
cimport cythoncpdef float return_first(np.ndarray[np.float64_t, ndim=1] numbers):
    return numbers[0]

Rust:

extern crate cpython;use cpython::{PyResult, Python, py_module_initializer, py_fn};py_moduly_initializer(rust_processing, |py, m|{
    m.add(py, "__doc__", "The module for Rust processing")?;
    m.add(py, "return_first", py_fn!(py, return_first(numbers: Vec<f32>)))?;
   Ok(())
});fn return_first(_py: Python, numbers: Vec<f32>) -> PyResult<f32> {
    Ok(_return_first(&numbers))
}fn _return_first(numbers: &Vec<f32>) -> f32 {
    numbers[0]
}

Now let’s test the performance of these three versions, using two sets of randomly generated arrays: 1000 (small) and 10,000 (large).

Surprisingly, Python is the fastest. How is it possible?

The thing here is that we run everything from Python, meaning that some additional time is necessary to convert data types from one language to another. The Rust implementation took the most time. Why is that? Rust has specific mechanisms for memory management to provide easy, stable, and reliable memory management for high-loaded applications. This is why for some simple operations Rust may be slower.

Let’s try something more complicated.

Time Series Cutting

At Exness we have all possible data on our clients’ trading and account activities. We use this data mostly for building machine learning models to predict lifetime value, lead scoring, churn, etc. Many of the algorithms demand treating the data as time series. For instance, if we analyze deposits and trade data as multiple time series, we need specific borders of single micro-cycles of clients’ lifetime within the company. Such borders could be determined by deposit activity, so each time series would start from one deposit and end on the next. We don’t need a too-short time series, so let’s limit the minimum number of days between the two deposits to 10 days.

In this demonstration, we will simplify the code for better readability. We will pass only a one-dimensional array of a daily sum of deposits (deposits variable) and search for pairs of days indexes, indicating where a time series starts and ends (borders array of tuples).

Now let’s implement it:

Pure Python:

def get_borders(deposits: List[float],
                min_length: int=10): -> List[tuple]
    borders = [ ]
    for i in range(len(deposits)):
        if deposits[i] > 0:
            if len(borders) == 0:
                borders.append([i])
            else:
                if deposits[i] - borders[-1][0] >= min_length:
                    borders[-1].extend([i])
                    borders.append([i])
    return borders

Cython:

cpdef list get_borders(
           np.ndarray[np.float64_t, ndim=1] deposits,
           int min_length=10,
):
    cdef int i
    cdef list borders = [ ]
    cdef int n_borders
    for i in range(len(deposits)):
        n_borders = len(borders)
        if deposits[i] > 0:
            if n_borders == 0:
                borders.append([i])
            else:
                if i - borders[n_borders - 1][0] >= min_length:
                    borders[n_borders - 1].extend([i])
                    borders.append([i])
    return borders

Rust:

fn get_borders(_py: Python, deposits: Vec<f32>) -> PyResult<Vec<Vec<i32>>> {
    Ok(_get_borders(&deposits))
}fn _get_borders(deposits: &Vec<f32>) -> Vec<Vec<i32>> {
    let mut borders: Vec<Vec<i32>> = Vec::new();
    for i in 0..(deposits.len()) {
        let n_borders = borders.len();
        if deposits[i] > 0 {
            if n_borders == 0 {
               borders.push(vec![i as i32]);
            }
            else {
                if i as i32 - borders[n_borders - 1][0] >= min_length {
                    borders[n_borders - 1].extend(vec![i as i32];
                    borders.push(vec![i as i32]);
                }
            }
        }
    }
    borders
}

Again let’s test the performance of these three versions and use two sets of randomly generated arrays: 1000 (small) and 10000 (large).

Things have changed. Both Cython and Rust are now notably faster than pure Python. Not stunningly so, but still significantly. And Cython is slightly faster than Rust. It can still be explained by more thorough Rust “preparation” for serious computations.

But what if we complicate the problem even further?

Time Series Cutting (Double Loop)

Now we’re going to add another loop inside the first one. In most cases, you would do your best to avoid it, but there are still applications where it’s inevitable. So let’s try it.

To keep it simple, we’ll just add a fictional loop doing nothing useful.

Pure Python:

def get_borders_double_loop(
                deposits: List[float],
                min_length: int=10,
): -> List[tuple]
    borders = [ ]
    for i in range(len(deposits)):
        for j in range(len(deposits)):
            temp = 0
        if deposits[i] > 0:
            if len(borders) == 0:
                borders.append([i])
            else:
                if i - borders[-1][0] >= min_length:
                    borders[-1].extend([i])
                    borders.append([i])
    return borders

Cython:

cpdef list get_borders_double_loop(
           np.ndarray[np.float64_t, ndim=1] deposits,
           int min_length=10,
):
    cdef int i
    cdef int temp
    cdef list borders = [ ]
    cdef int n_borders
    for i in range(len(deposits)):
        for j in range(len(deposits)):
            temp = 0
        n_borders = len(borders)
        if deposits[i] > 0:
            if n_borders == 0:
                borders.append([i])
            else:
                if i - borders[n_borders - 1][0] >= min_length:
                    borders[n_borders - 1].extend([i])
                    borders.append([i])
    return borders

Rust:

fn get_borders_double_loop(_py: Python, deposits: Vec<f32>) -> PyResult<Vec<Vec<i32>>> {
    Ok(_get_borders(&deposits))
}fn _get_borders_double_loop(deposits: &Vec<f32>) -> Vec<Vec<i32>> {
    let mut borders: Vec<Vec<i32>> = Vec::new();
    let mut temp: i32
    for i in 0..(deposits.len()) {
        for j in 0..(deposits.len()) {
             temp = 0
        }
        let n_borders = borders.len();
        if deposits[i] > 0 {
            if n_borders == 0 {
               borders.push(vec![i as i32]);
            }
            else {
                if i as i32 - borders[n_borders - 1][0] >= min_length {
                    borders[n_borders - 1].extend(vec![i as i32];
                    borders.push(vec![i as i32]);
                }
            }
        }
    }
    borders
}

Here is the execution speed measurement:

Interesting, isn’t it?

Python implementation went down quite expectedly: more than a thousand times performance drop on a large dataset. Cython experienced the drop, though only 300 times slower than itself. Compared to pure Python speed increase has exceeded 12 times. Not bad.

But look at Rust. It struggled to compete with Cython in single loop, but for double loop, its performance hasn’t changed a bit! It demonstrates relatively the same performance as Cython on a single loop but has a 245 times advantage for a double loop.

That’s because Rust inhibits good memory management processes that, on one hand, lead to longer initialization times, but on the other, mean recursion and nested loops are processed much faster.

Cython Note

While pure Python doesn’t demand variable typing at all, Rust demands strict typing of every variable. But Cython is somewhat flexible here. You can choose any point between totally static and totally strict types and that would greatly influence the performance.

Look at the example above. If we “forget” to determine the return type, the algorithm performance will drop about 2–3 times (120 μs instead of 80 for the smaller dataset and 1.5 ms instead of 0.6 ms for the bigger one).

If your Cython algorithm doesn’t improve performance as much as you expected, just try double-checking all the variable definitions. Maybe you missed something or assigned some variable to an inefficient type.

Conclusion

What can we conclude? Cython and Rust are capable of improving your algorithm performance quite dramatically. But as always there are pros and cons.

Python

+ Easy to use

+ Same code as the main program. No need to jump between languages

- Poor performance for high load computations

Cython

+ Syntax is similar to Python

+ Gradual performance increase with the code elaboration

+ Significant performance increase

- No compulsory typing, it’s easy to miss a variable type and lose performance dramatically

- Still, more complex syntax compared to Python

- Performs worse than Rust in nested loops and recursions

Rust

+ Significant performance increase even compared to Cython on nested loops and recursion

- Much more complicated syntax

What to choose? As always, it’s up to you. But first, make sure you have close to an optimal algorithm in Python. Maybe you don’t even need any boost. Then assess how complex the algorithm is. The closer it is to a simple loop, the more arguments you have for choosing Cython. If it contains nested loops, recursion, or other computations with parallelization potential, then Rust is probably the best choice.

What’s next?

Making loops for efficient processing is good. But are there fast ways to process whole datasets? We will try to figure it out next time by comparing different data processing frameworks.

Also, we will look into the efficiency of parallel processing for different languages.

Benchmark Stand

In this article, all code examples and execution speed benchmarks are conducted on MacBook Pro, Apple M1 Pro chip (16 Gb), Python 3.9.7 for ARM64 architecture.

When Python is not enough. Rust and Cython extensions

What’s wrong with Python?

Do you need custom extensions for your Python code?

What are Rust and Cython?

Getting started with Rust and Cython

Rust

Cython

Benchmarks

Description

First Element

Time Series Cutting

Time Series Cutting (Double Loop)

Cython Note

Conclusion

What’s next?

Benchmark Stand

Written by Igor Demidov