Calling WebAssembly from Python 25x Faster

Moving hot loops from Python to WASM won’t be feasible without this trick

8 min readMar 5, 2023

Sometimes in your large python project you would come across a hot loop (busy loop, active loop, iteration over large data) in some algorithm of some kind, which is something you typically want to avoid doing in python. Typically in python you focus in your high-level business logic not low level hot iterations. Python comes batteries-included in many cases you don’t need to write low-level hot loops and if it’s not built-in, you can find an optimized python package that do it (Pillow, NumPy, Tensorflow ..etc).

But what if you can’t find a package for your special need or you are evaluating a new algorithm. In that case you will run into a hit because of the nature of interpreted languages that does not JIT.

Examples might include:

implementing a custom image filter
implement a custom hashing function like djb_hash of CDB
implementing a custom compression algorithm like zstd
fallback of the above if the native-C library (compiled or ctype or CFFI) is not available.

a typical way of doing this is to do it in C, but this mean that the python package would need to be recompiled for each platform..etc.

WebAssembly to the rescue

You can write the hot loop in Rust, C or WAT then recompile it to WASM as a compile target, then distribute the WASM file along with your pure python module.

In this article we are going to evaluate this new technology using a toy problem that is representative.

Our Toy Problem: Histogram Matching

We want to take an input image and a reference image. We want to transform the input image to have a color signature similar to the reference image by matching the histogram. On-line in-browser Histogram Matching can be done from the browser here.

Our Toy problem is to take a histogram from reference image (in the middle) and apply its colors to the input Image

This toy problem involves the following:

iterating over large multi-dimensional data (~10 MB)
multi-dimensional array flattening
flipping axis (from/to color channel first/last), matrix transpose
counting, accumulating (CDF), float-point math
per item lookup mapping

The Algorithm

take input image (to be transformed) in the form of WxHxC and reference image (to be matched/imitated)
for each image, construct CDF for each color channel, that is one CDF for red channel, another CDF for green channel, and last one blue. First we calculate frequency of each value, then we sum the value of with previous values to make it cumulative.
CDF of each channel in each image maps an uint8 value (0–255) to a float 0.0–1.0
we need to construct a mapping from 0–255 in input image to 0–255 in output image such that:

map[Vin]=least Vout having CDF[Vin] ≤ CDF[Vout]

for each pixel Vi in each color channel in input image we need to change it to Vo = map[vi]

If we want to make it a camera filter, we need all this to happen in less that 50ms (~20fps). The same mapping array can be reused from frame to frame but this is not part of this article, as we want to test all those steps. Which is doable since the browser version do it on ~44ms.

I’ve implemented it in C in a file called “histo_match.c” then compile it into WASM file named “histo_match.wasm” (1.3KB gzipped, 3.4KB extracted) using LLVM. Details in the C file itself. I’ve used many tricks to make this WASM file small and fast but that’s a topic for another article

Pure Python Attempt

The source code of this can be found in my github repo, the file is named histo_py_slow.py. Transferring a single image took ~2 seconds which is like this

Calculating the CDF of both input image and reference image took ~550 ms which is too much as it has to iterate over each pixel in the 3 color channels that is 1086*688*3=2,241,504 and 1920*1080*3 = 6,220,800 which is 8.5 million value.
constructing the mapping array took no time (it only do 256 iterations)
iterating over input pixel and apply the mapping lookup to construct the output image 120 ms
at this point we have 3 output channels we need to merge them again to form the RGB image so that for each x,y location in the WxH we have adjacent RGB values (channel last format) this took ~200ms
total time from begging to end took ~2,000 ms

We have done several tricks to make this as fast as possible yet, it’s very fart from real-time filter objective as 2,000 ms is 0.5 FPS. Among the tricks we do

construct fixed sized arrays with h=[0]*256 instead of keep pushing the array limit
to calculate cumulative distribution we only add previous data point only instead of all previous points
use efficient types like “bytearray” when possible
avoid creating intermediate objects or recreating lists

we have tried several alternative methods.

Enter Wasm3 for Python — interpreter inside interpreter is 10x better

Wasm3 is a very portable WebAssembly interpreter with no JIT ..etc making it simple and portable, it can even run on IoT boards with handful of KB of RAM. We got it installed with

pip install --user wasm3

Our Wasm3-based python script named “histo_wasm3.py” was able to achieve 190 ms that is ~ 5 FPS which is 10x better than the pure python implementation and most of the time ~170ms was taken by our WASM function and only small fraction creating the Pillow image or passing things around.

One might think that “interpreter inside interpreter” should be slower than the interpreter it runs on but this is not the case, mainly because Wasm3 is written in C and wrapped in Python. It lives inside Python but not on top of Python.

Enter Wasmer for Python — JIT with shocking bottleneck

Wasmer is one of the most popular WASM/WASI runtimes with focus on server-side application.

pip install --user wasmer wasmer-compiler-singlepass

Our Wasmer-based Python script named “histo_wasmer_slow.py” was able to execute the histogram matching function in blazing fast 44 ms on bar with web version and (as expected) beating the 170 ms of interpreted Wasm3 with a huge catch! Passing the data took ~400 ms and the entire process end-to-end took ~470 ms.

Remember the only reason we are trying to escape pure python, is the hot loop that iterates over large data. If passing the large data takes too much time then this solution is useless.

The bottleneck seems that their WASM memory implementation suffers on slice assignment operation “__setitem__”, it’s either moving them byte by instead of entire slice/block or the slice assign involves do-nothing Endiannes Reformating.

I’ve opened an ticket for issue numbered #700 if you want to keep an eye on it.

Enter WasmTime — Faster JIT with same bottleneck

WasmTime is another WASM/WASI runtime from Bytecode Alliance with JIT support. I’ve installed its python binding using

pip install --user wasmtime

Our WasmTime Python script named “histo_wasmtime_slow.py” in my first attempt it was blazing fast running the histogram matching function in less than 20 ms but it fallbehind in the thoughput of passing data as end-to-end it took more than 1,000 ms. The reason for this that their memory implementation does not implement slice assignment at all, which means you need to pass data one byte at a time and in our example we are passing more than 6 million bytes. I was about to report this issue but found that it was already reported #81 and tried the proposed solutions and all of them were very slow. I tried to use memoryview but could not due to strange endianness problem.

WasmTime eliminating the bottleneck

I was able to eliminate the bottleneck with a simple hack. As you can see in my faster attempt named “histo_wasmtime_fast.py” I’ve used “ctypes.memmove” to copy the entire data from/to WASM memory. It’s styled-over memcpy/memmove standard C library function.

def set_slice(val, start=0, end=None):
    size = memory.data_len(store)
    if end is None: end=start+len(val)
    val_size = len(val)
    if end-start>val_size or end>size:
        raise IndexError("out of memory size")
    src_ptr = (ctypes.c_ubyte * val_size).from_buffer(val)
    dst_ptr = ctypes.addressof((ctypes.c_ubyte*val_size).from_address(ctypes.addressof(memory_ptr.contents)+start))
    ctypes.memmove(dst_ptr, src_ptr, val_size)
    return


def get_slice(start=0, end=None):
    size = memory.data_len(store)
    if end is None: end=size
    if end>size:
        raise IndexError("out of memory size")
    val_size=end-start
    val=bytearray(val_size)
    dst_ptr = (ctypes.c_ubyte * val_size).from_buffer(val)
    src_ptr = ctypes.addressof((ctypes.c_ubyte*val_size).from_address(ctypes.addressof(memory_ptr.contents)+start))
    ctypes.memmove(dst_ptr, src_ptr, val_size)
    return val

I’ve opened a pull request #134 to make this part of WasmTime python binding, I hope will be merged to inspire a solution.

With this trick end-to-end the process took ~45 ms (which is ~ 22 FPS), of which ~20 ms the actual function, 6 ms slice assignment, and the rest is creating the Pillow image.

UPDATE: fast access is now implemented in it’s own class

from wasmtime_fast_memory import FastMemory
fast_mem = FastMemory(memory, store)
# ... getting data in
fast_mem[0:] = a_ref
fast_mem[len(a_ref):] = a_in
# ... taking data out
dst = fast_mem[len(a_ref)+len(a_in): len(a_ref)+len(a_in)+len(a_in)]

Zero-Copy NumPy Buffer Protocol (both Wasmer and Wasmtime)

Matt McCormick suggested to use NumPy Protocol to get direct access to WASM memory which in Wasmer can be used like this

np_mem = np.frombuffer(instance.exports.memory.buffer, dtype=np.uint8, offset=hb)
np_mem[0:len(a_ref)]=a_ref
np_mem[len(a_ref): len(a_ref)+len(a_in)]=a_in
# ...
a_dst = np_mem[len(a_ref)+len(a_in):len(a_ref)+len(a_in)+len(a_in)]

and in Wasmtime it goes like this

ptr_type = ctypes.c_ubyte * (memory.data_len(store))
memory_ptr = (ptr_type).from_address(ctypes.addressof(memory.data_ptr(store).contents))
np_mem = np.frombuffer(memory_ptr, dtype=np.uint8, offset=hb)
np_mem[0:len(a_ref)] = a_ref
np_mem[len(a_ref):len(a_ref)+len(a_in)] = a_in
# ...
dst = np_mem[len(a_ref)+len(a_in): len(a_ref)+len(a_in)+len(a_in)]

This will avoid time waster copying from/to WASM memory.

Goodbuy WASM — Enter the superhero NumPy

As with “Speed up your Python using Rust” article (reddit follow ups), which tried to speed a pure python code by writing a blazing fast compiled Rust module then bind it to Python. The Rust attempt was smashed time after time by the fast performance of the heavily tuned python module NumPy.

So if you are struggling with performance, just “import antigravity” as in the xkcd.

The faster pure python numpy-based script named “histo_py_fast.py” took end-to-end ~28 ms (40 FPS) most of which is Pillow creating image.

reshaping numpy’s ndArrays takes no time
transpose axis, moving color channel first to last and vice versa also takes no time “a.T”
CDF took ~6 ms which was done using “np.cumsum()” over Pillow’s histogram
the actual mapping was done using “np.take()” which took ~6 ms

Summary

NumPy-based script was twice as fast as best WASM and 70x times faster than first pure python attempt

If you can do it with NumPy, Just do it and keep it simple.
If your usecase can’t be done by NumPy then make a C/Rust module with a WASM fallback to make it portable
Use WasmTime with the given trick to pass large data to the WASM module instead of passing it one byte at a time this will make WASM 25x faster.
Copying data is unavoidable in WASM because of memory safety as it’s not shared with host but it should not be that slow if you use the given trick.