Python at the speed of light
Even if every new version dramatically improves performance, Python is still a high-level interpreted language without strong typing; therefore, its speed differs from low-level languages such as C. Moreover, the GIL (Global Interpreter Lock) makes Python de facto single process: you can have multiple threads, but just one is running at a time.
Luckily, the Python community is active, and numerous specialized packages, most of which are implemented in C, can offer relevant speed gains and overcome the GIL issue.
Let’s make a trivial example, filling a list of n integer numbers.
def populate_python(size:int)->list:
b = []
for i in range(size):
b.append(i)
return b
This simple function receives a value (integer) and returns a list filled with integers. I have used type hints, which are extremely useful to check the code and make it more readable, but you are not constrained to the requested type. Python will accept any value for the function and will raise an error at execution time if you pass an incompatible one.
The first problem is that a Python list can contain any value, such as an integer, float, string, or even another list. It’s really convenient, but it cannot be optimized. Thus, list operations in Python are relatively slow. The second issue is the for loop that in Python tends to be slow (even if recent versions have made giant steps ahead) and cannot be parallelized. Running the function with size = 10,000,000 took, on average, 765 ms on my machine. Not bad after all. Can we do better?
Using numpy
numpy array is different from a Python list. In numpy, each array element must be of the same type. This makes memory management more straightforward and computation faster. Moreover, numpy often uses C internally. In our case, though, there is no gain; rather the speed decreased.
import numpy as np
def populate_numpy(size:int)->np.ndarray:
b = np.empty((size),dtype=np.int64)
for i in range(size):
b[i] = i
return b
Running the function with the same number of elements, took 964 ms. As a matter of fact, the function does not take advantage of numpy vectorization and only adds overhead.
Using numba
Luckily we can rely on numba, a just-in-time compiler, to speed up things.
from numba import njit, prange
@njit
def populate_numba(size:int)->np.ndarray:
b = np.empty((size),dtype=np.int64)
for i in prange(size):
b[i] = i
return b
As we can see, the function is almost the same. I just added a decorator and prange (the range function in numba that runs in parallel). This time the computing time was just 16 ms! Nearly 50 times faster than bare Python. That is an impressive gain.
Julia
Julia is another language that is gaining traction. It aims to offer almost the same flexibility and clear syntax as Python but with high-speed compiled code.
function populate_array(size::Int)::AbstractVector{Int64}
b = Vector{Int64}(undef,size)
Threads.@threads for i=1:size
b[i] = i
end
return b
end
Julia has no GIL issues so the threads can run in parallel. It took just 12 ms.
Mojo
Mojo is a new language under active development. According to the website: Mojo combines the usability of Python with the performance of C, unlocking unparalleled programmability of AI hardware and extensibility of AI models.
At present, Mojo is not yet available, but you can ask to access the playground, a Jupyterlab-like environment where you can try this new language. Even if you can use pure Python in Mojo, you must use a different syntax closer to C than Python to take full advantage of its superior speed.
from Pointer import DTypePointer
from Random import rand, random_ui64
from DType import DType
from Range import range
from Functional import parallelize
import SIMD
struct Vect:
var data: DTypePointer[DType.uint64]
var rows: Int
fn __init__(inout self, rows: Int):
self.data = DTypePointer[DType.uint64].alloc(rows)
self.rows = rows
fn __del__(owned self):
self.data.free()
@always_inline
fn len(self)->UInt64:
return self.rows
fn zero(inout self):
memset_zero(self.data, self.rows)
@always_inline
fn __getitem__(self, x: Int) -> UInt64:
return self.data.load(x)
@always_inline
fn __setitem__(self, x: Int, val: UInt64):
return self.data.store( x, val)
fn populate_mojo(b:Vect):
@parameter
fn process_row(i:Int):
b[i] = i
parallelize[process_row](b.rows)
The above function took just 7 ms on the playground, 110 faster compared to Python, and much faster than Julia. I could not test it on my local machine, but it would have been even faster, and I am not sure I am using Mojo at its best. This is Python at the speed of light!