Pythran: Python at C++ speed !

Case Study: Simulated Annealing applied to the TSP

Olivier Borderies
14 min readJan 24, 2019

Python is a high level, versatile language that is almost as easy to read and write as pseudo code. On top of it Numpy is a ubiquitous package for scientific computing that eases the use of N-dimensional array objects. Together they are ideal to implement algorithms concisely — and Numpy array operations are fast because often compiled to C or Fortan.

Pythran helps convert Python+Numpy functions to native code with only a few annotations and a relatively simple installation process. More precisely its core developer, serge_sans_paille (whose commitment to the project and reactivity are quite impressive, by the way), describes it as “an ahead of time compiler for a subset of the Python language, with a focus on scientific computing. It takes a Python module annotated with a few interface description and turns it into a native Python module with the same interface, but (hopefully) faster. It is meant to efficiently compile scientific programs, and takes advantage of multi-cores and SIMD instruction units. Pythran supports Python 2.7 and also has a decent Python 3 support.” It works on Linux, macOS, and Windows.

In this article we try to show how simply it can be used in a relatively generic case: the search of solutions for the Traveling Salesman Problem using the Simulated Annealing class of algorithms.

All the code demonstrated in this article is available in this gitlab repo: oscar6echo/tsp-pythran.

Traveling Salesman Problem (TSP)

The well known Traveling Salesman Problem(TSP) states this problem: “Given a list of N cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?”. It is an NP-hard problem and has a huge (N-1)!/2 candidate solution space. Consequently, attempting to solve it is very computing intensive.

Let us consider the problem of 100 cities randomly located in the [0, 1]x[0, 1] plane, generated by the demo-tsp-pythran notebook.

100 random cities in the [0, 1]x[0, 1] plane

Simulated Annealing (SA)

Simulated Annealing (SA) is a heuristic for approximating the global optimum of a given function. It is often used when the search space is discrete, and works in a fixed amount of time. Conveniently it does not impose any prerequisite on the function to optimise.

# Objective
# Find energy with low, hopefully minimum, energy
# Inputs
energy(): function state -> energy
mutate(): function state -> state - typically random
temperature T reduction profile from Tmax down to Tmin
# Start
s = init_state
for T from T_max to T_min:
E_s = energy(s)
m = mutate(s) # random
E_m = energy(m)
dE = E_m - E_s
if rand(0, 1) < exp(-dE/T):
# always true for dE < 0
# sometimes true for dE > 0, all the more so if T is high
s = m
return s # should be a low energy state

The general idea is that instead of moving only to a lower energy state, it is allowed to explore states that are not necessarily lower than the current best in the hope of reaching regions containing lower optima, a bit like a high energy particle can move outside a local potential well and travel to lower potential regions and ultimately settle there when it energy (i.e. temperature) subsides.

This is illustrated in the graph below.

from https://www.researchgate.net/publication/308786233_A_Review_of_Optimization_Techniques_in_Artificial_Networks

As well known as this algorithm is, it maybe not be as famous as it deserves: Its extreme simplicity and many parameters and predictable runtime makes it applicable to a wide range of real world problems.

As you can see there is a lot to tailor in this description:

  • The energy(state) function
  • The mutate(state) function
  • The temperature reduction profile

The most important one is probably the mutate() function which must be designed to generate an interesting candidate state from the current one. Obviously this function is eminently related to the problem to solve. You can see it as a kind of genetic mutation, which is either kept or discarded.

In our TSP case, the mutations applied at each step are — for i, j random path positions:

  • Reverse path between cities in i-th and j-th position
  • Move city in i-th position to j-th position
  • Swap city in i-th position with city in j-th position

Which of these 3 mutations occurs at each step is also random.

If evaluating the energy of a state is expensive, then it may be more efficient to keep track of the states already visited and avoid re-evaluating the same energy twice.

In the case of the TSP the unique signature of a state is the tuple of cities (mapped to integers from 0 to N) starting with 0 such that the second element is less than the last — to account for the fact that a city tour has no direction.

The TSP energy() function is extremely simple and fast to compute so the overhead of keeping tracks of signatures is not worth it (as experience shows). But in other cases the trade off may be in favour of checking signatures before evaluating a state.

For a more precise description see tsp_compute_single_threaded.py: The store_stats keeps track of how often signature collisions take place:

if check_signature:
store_stats = {
'nb_step': nb_step,
'store_len': len(store),
'ratio': len(store)/nb_step,
'compute_energy': compute_energy,
}
return (best_energy, best_path, store_stats)
else:
return (best_energy, best_path, {})

Pythran Annotations

When writing any algorithm it is a good idea to start in pure Python, and Numpy if it can be concisely expressed in matrix form, then then benchmark and fine tune it until it works as expected. Indeed, “premature optimisation is the root of all evil (or at least most of it) in programming” (Donald Knuth).

Well Pythran is totally compatible with this approach: the only tiny addition you have to make in the source code is an annotation (as a regular Python comment line starting with “# pythran”) for the numerical function you want to accelerate. From this interface information Pythran turns it into C++ source code, compiles it and creates a native Python module.

Here is an example of annotation:

# pythran export search_for_best(int, float list list, int, float, int, float, float)def search_for_best(seed,
cities,
nb_step,
beta_mult=1.005,
accept_nb_step=100,
p1=0.2,
p2=0.6):
"""exported function"""
# etc

More generally Pythran annotations are allowed anywhere in a module, but most likely at the top of the file or just above the functions to compile. They have the following syntax:

# pythran export function_name(argument_type*)

where function_name is the name of a function defined in the module, and argument_type* is a comma separated list of argument types, composed of any combination of basic types and constructed types.

Below is the list of basic types. Constructed types are either tuples, introduced by parenthesis, like (int, (float, str)) or lists (resp. set), introduced by the list (resp. set) keyword.

Here is the full grammar:

argument_type = basic_type
| (argument_type+) # this is a tuple
| argument_type list # this is a list
| argument_type set # this is a set
| argument_type []+ # this is a ndarray, C-style
| argument_type [::]+ # this is a strided ndarray
| argument_type [:,...,:]+ # this is a ndarray, Cython style
| argument_type [:,...,3]+ # this is a ndarray, some dimension fixed
| argument_type:argument_type dict # this is a dictionary
basic_type = bool | int | float | str | None
| uint8 | uint16 | uint32 | uint64 | uintp
| int8 | int16 | int32 | int64 | intp
| float32 | float64 | float128
| complex64 | complex128 | complex256

It is rich enough to accommodate the vast majority of numerical functions. A key point is that ndarrays are accepted, which enables regular Python + Numpy users to accelerate their code “as is”. They only have to isolate the compute intensive operations in some fixed type interfaced functions, which is natural to do anyway.

Install Pythran

Pythran can be installed on the 3 main OS families: Linux, macOS, Windows. Below I assume miniconda3 is installed. Here are the steps that worked for me:

Linux

Here are the steps that worked on a cloud VM. Also check the script_setup_linux.sh.

  • Install compiler
# ubuntu
$ sudo apt-get update
$ sudo apt-get -y install g++
  • Install Pythran in a dedicated conda env:
$ conda create -n pythran python=3
$ source activate pythran
$ pip install numpy
$ pip install pythran

macOS

Here are the steps that worked on macOS (HighSierra and Mojave). Also check the script_setup_macos.sh.

$ brew install llvm
  • Install Pythran in a dedicated conda env:
$ conda create -n pythran python=3 -y
$ source activate pythran
$ conda install -c conda-forge pythran -y

The alternate way, pip install pythran, misses some low level dependencies like blas. So I would recommend the conda install for simplicity.

  • SDK Headers

On macOS Mojave you may have to install SDK Headers for macOS 10.14. To do that:

# sudo required
$ xcode-select --install

Then go to /Library/Developer/CommandLineTools/Packages/. There is the package macOS_SDK_headers_for_macOS_10.14.pkg. Install it (double click or Cmd+O).

  • Create a ~/.pythranrc file:
[compiler]
cflags=-std=c++11 -fno-math-errno -w
ldflags=-L/usr/local/opt/llvm/lib # from `brew info llvm`
CC=/usr/local/opt/llvm/bin/clang # brew installed clang path
CXX=/usr/local/opt/llvm/bin/clang++ # brew installed clang++ path

Windows7

Here are the steps that worked on a Windows7 desktop.

  • Install compiler

Install Visual Studio Community 2017 workload “Desktop Development with C++"

  • Install Pythran in a dedicated conda env:
$ conda create -n pythran python=3
$ source activate pythran
$ pip install numpy
$ pip install pythran
  • Make sure ~/.pythranrc does not exist or is empty or is as follows:
[compiler]
defines=
undefs=
include_dirs=
libs=
library_dirs=
cflags=/std:c++14
ldflags=
blas=blas
CC=
CXX=
ignoreflags=

Compile and Run with Pythran

Compile

After the code is annotated as explained above I just compiled the 2 files, each containing one annotated function:

# regular compilation - under pythran env
#
In my case it's safe to use -Ofast (affects precision)
$ pythran -Ofast -march=native tsp_compute_single_threaded.py
# compilation with omp - under pythran env
# the compilation flags activate OMP and vectorization using
#
https://github.com/QuantStack/xsimd
$ pythran -DUSE_XSIMD -fopenmp -march=native tsp_compute_multi_threaded_omp.py

In each case, if Pythran is properly installed, it silently creates a native module named as follows:

  • on macOS: tsp_compute_[xxx].cpython-37m-darwin.so
  • on Linux: tsp_compute_[xxx].cpython-37m-x86_64-linux-gnu.so
  • on Windows: tsp_compute_[xxx].cp37-win_amd64.pyd

Run

The native module behaves like a regular Python module to the user who imports it — except it runs faster! If a both regular and native modules exist, the native module is imported.

Parallelism

Looking for speed does not stop at accelerating numerical operations. Going parallel where possible is also an important source of improvement. In our TSP example SA is a probabilistic search so it can be paralleled. Here we have tested 2 techniques:

OMP

OMP (Open Multi Processing) is an API that allows multi processing programming through a relatively simple API. It enables a developer to describe parallelism to C++ code through OMP directives introduced with comments starting with “# pragma omp”.

Pythran allows a Python developer to write such directives in Python code as comments starting with “# omp” and passes them to the compiler.

OMP is powerful but as for most low level stuff, if anything goes wrong, it may be tricky to know why and correct it.

See the file tsp_compute_multi_threaded_omp.py. The two annotations enabling OMP are:

  • “parallel for”:
# omp parallel for
for x in range(n_run):
beta = 1.0
n_accept = 0
etc...
  • “critical”:
# omp critical

concurrent.futures

For coarser parallelism across all CPUs of a machine at Python level, using concurrent futures is a very efficient alternative for this algorithm.

See the search_concurrent() function in tsp_wrapper.py.

It uses ProcessPoolExecutor to spawn sub processes and as_completed to collect the results, as demonstrated in this example in the official concurrent.futures documentation.

Demo package and notebook

This demo-tsp-pythran notebook contains the user interface for the tsp-pythran demo package. It allows to:

  • Compile the tsp_compute_(single|multi)_threaded.py functions into native modules — IMPORTANT: Restart kernel after compiling to load native module
  • Remove the native modules to go back to Python modules
  • Generate a random set of cities and set search params
  • Run both search implementations: “concurrent” and “OMP”, with and without signature check (see below)
  • Visualise and save the results

Here are the 3 best routes found for several distribution of 100 cities — each search used 32 concurrent runs with 1e6 steps without signature check:

3 good solutions (random seed 54321)
3 good solutions (random see 12345)
3 good solutions (random see 33333)
3 good solutions (random seed 11111)

Performance

I ran the TSP search:

  • in 2 versions: “concurrent” and “OMP”, with/without “signature check”
  • on 3 types of machines: laptop (macOS), desktop (iMac), Linux (VM)
laptop: macOS MacBook Pro 2017 2.3 GHz Intel Core i5
desktop: macOS iMac 2014 4.0 GHz Intel Core i7
VM GCP: Linux n1-highcpu-64 — Ubuntu:18.04

The signature check did not prove efficient for the TSP problem, as anticipated. So I focused on the simpler version without signature check.

The table below shows the results for a TSP of 100 cities, each task with 1e6 steps.

It has been created by running the demo-tsp-pythran-perf-measures.ipynb notebook on the 3 machines, and aggregating results with the demo-collect-perfs.ipynb notebook.

Note: The numbers were produced for only one run in each case so expect some variability if you re-run the notebooks.

The number of cores taken into account is:

  • for macOS: the physical (not logical) number of cores for macOS
  • for the Linux VM the number of CPUs (64) divided by the number of threads per core

With this definition (i.e. half of the “marketing” definition for these 3 machines), then several concurrent tasks run in about the same time as one task, if the number of concurrent tasks does not exceed the number of cores — which validates the “definition” of number of cores.

The Pythran speed up is ~x16 on macOS and ~x32 on the Linux VM.

Adding concurrency to Pythran, the increase in the number of steps is ~x32 on the laptop, ~x70 on the desktop, and a whopping ~800 on the Linux VM !!

It must be emphasised that this speed up is obtained with:

  • minuscule extra work for Pythran !
  • a reasonable effort for concurrent futures
  • a decent price (e.g. ~$2.3/hour on GCP — $0.8 if preemptible)

The OMP version performance is just below that of the concurrent version, on macOS. But the extra work to use OMP is quasi zero with Pythran (2 annotations) while using concurrent futures is more substantial — but not daunting.

What puzzles me is that the promise of OMP crashes catastrophically on the Linux VM. For some reason (unknown to me), it does not handle OMP as well as macOS… This is a pity because if it did, the ratio speed up / effort would beat any other solution.

Now the remarks above consider raw speed only. In the case of the SA algorithm, search breath (number of threads) and depth (number of steps in one single threaded search) are not equivalent. It turns out that depth is much more critical to finding good solutions. So practically the benefit of the Linux VM is that it searches more broadly and consequently is more likely to find good quality solutions in a slightly shorter time.

However for other problems where the search can be split more efficiently, then the speed up offered by a combination of Pythran acceleration, concurrency over many-cores single machines can yield great immediate benefits to the practitioner.

Packaging/Distributing

When it come to packaging/distributing Pythran’s intelligent design proves particularly convenient. Depending on your objectives, you can choose to distribute:

  1. the Python source code (.py)
  2. the C++ generated code (.cpp)
  3. the native module (.so/pyd)

First, Pythran provides a distutils extension that enables standard use in the setup.py. Cf. the distutil integration section in the official documentation. For a working example, see the “Distributing Python source code (.py)” section below.

If you distribute the Python source code your users must have Pythran and a compiler on they computer. If you do not want to share the source code or your users does not always have compiler (e.g. Visual Studio is long to download and requires admin rights to install). If you are only concerned with privacy and do not share the source code, you can distribute the C++ generated code simply obtained with the `-E` flag as follows. It is heavily templated, cryptic code which is difficult to reverse engineer.

# generate .cpp and .so files
$ pythran -E tsp_compute_single_threaded.py

Finally if you just want your users to use the native module you can distribute the .so/pyd files instead of the corresponding .py files. But then the native module must be compiled on the same platform — often the case in medium sized organisations.

Here is how to packages in these 3 cases:

First generate the .cpp and .so/pyd files from the .py files.

Distributing Python source code (.py)

Use this setup.py and this MANIFEST.in files to:

  • Exclude the .cpp and .so/pyd files for the bundle (MANIFEST.in: see recursive-exclude)
  • Launch Pythran compilation from .py upon installation (setup.py: see ext_modules)

Distributing C++ generated code (.cpp)

Use this setup_2.py and this MANIFEST_2.in files to:

  • Exclude the .py and .so/pyd files for the bundle (MANIFEST.in: see recursive-exclude)
  • Launch Pythran compilation from .cpp upon installation (setup.py: see ext_modules)

Distributing native module (.so/pyd)

Use this setup_3.py and this MANIFEST_3 files to:

  • Exclude the .py and .cpp files for the bundle (MANIFEST.in: see recursive-exclude)

Additionally you can embed compiling instructions in the source files, and also expose them as a module function: see compiler.py. This may have the benefit of raising the awareness of the end user — if it is relevant.

It is also important to mention that Pythran generated C++ code can be deployed “as is”. There are always these conflicting requirements: Fast prototyping in a high-level human-friendly language (e.g. Python) and deploying in a fast low-level machine-friendly language (e.g. C++) are difficult to reconcile. See the article Pythran as a bridge between fast prototyping and code deployment for a more thorough development.

Well Pythran helps bridge this eternal chasm!

Note: Distributing native code compiled without an active effort to obfuscate it does not guarantee the confidentiality of the source code. To achieve that goal you may explore obfuscator-llvm or epona.

Remote Jupyter

To run a Jupyter notebook server on a cloud ☁️ VM and thereby benefit from more computing power (in general: CPU, GPU, #cores, memory; in our TSP case CPU speed and #cores), see these step-by-step setup guides, from zero to playing the demo-tsp-pythran notebook on:

Conclusion

In this article we have:

  • installed Pythran on Linux (Ubuntu), macOS, Windows
  • run a Jupyter notebook off a cloud VM (GCP, Azure)
  • briefly presented the Simulated Annealing Algorithm (SA)
  • applied the latter to the Traveling Salesman Problem (TSP) as an sample problem
  • implemented the SA in Python — with signature checking (if evaluating energy state is expensive)
  • accelerated it by up to 3 order of magnitude with:
  1. Pythran to convert Python to C++, with annotations, incl. OMP directives
  2. concurrent futures to run parallel tasks (on local cores) from Python
  • and the various ways to distribute Pythran modules

Overall the convenience, performance, flexibility of Pythran are simply eye-popping! 👀 Undoubtedly it a very valuable productivity booster.

Kudos to the developers !! 👏

--

--