Beating NumPy performance speed by extending Python with C

Yan Yanchii
Analytics Vidhya
Published in
14 min readOct 21, 2019

Embedding C into Python for performance speed up.

Photo by Nick Hillier on Unsplash

Have you ever wondered how does NumPy perform its complex computations so fast? The secret is that it uses lower level programming language C. Passing data to C, performing calculations and returning it back to Python. What can be easier would you ask? In fact, this is not a completely trivial task as you might think.

In this article I would like to show how you can extend your Python code with C for better performance speed and also will show how you can actually make calculations faster than NumPy. This assumes that you are already familiar with some Python and C fundamentals.

Comparing performance of pure Python dot product to NumPy

For this article purpose I will be comparing speed of performing dot product on 2 arbitrary matrices.

The most simple and straightforward way to do a dot product of 2 matrices in Python, and any other programming language is using 3 nested for loops. Let’s consider following example.

We will start off by importing all necessary modules and creating helper functions for comparing performances in python_dot_product.py file,

#python_dot_product.pyimport time
import numpy as np
import
random as rn

def
timer(func):
def wrapper(*args, **kwargs):
before = time.time()
result = func(*args, **kwargs)
after = time.time()
return after - before, result
return wrapper
def generate(size, range_):
arr = [[[rn.randrange(*range_) for _ in range(size)] for _ in range(size)] for _ in range(2)]
return arr

where timer function is just a simple decorator function to help us measure time taken when executing our functions, and generate function is just providing us with an 2D array of random numbers.

For this tutorial’s convenience, I will be using matrices of size MxM.

The next step is to create Python and NumPy implementation functions of dot product and applying our timer decorator to them.

@timer
def python_implementation(arr1, arr2):
result = [[0 for _ in range(len(arr1))] for _ in range(len(arr2[0]))]
for i in range(len(arr1)):
for j in range(len(arr2[0])):
for k in range(len(arr2[0])):
result[i][j] += arr1[i][k] * arr2[k][j]
return result
@timer
def
numpy_implementation(arr1, arr2):
return np.array(arr1).dot(arr2)

And finally comparing results. I will test our functions with matrix size of 500x500 to demonstrate you clear difference in performance.

if __name__ == '__main__':
data = generate(size=500, range_=(1, 100))
numpy_time_taken, numpy_result = numpy_implementation(*data)
python_time_taken, python_result = python_implementation(*data)
print(f"time taken with numpy: {numpy_time_taken} seconds")
print(f"time taken with python: {python_time_taken} seconds")

Output:

time taken with numpy: 0.16844797134399414 seconds
time taken with python: 32.14348006248474 seconds

Wow, it turns out that NumPy is approximately 320 times faster than naive Python implementation of dot product.

So how can we speed up our code? The thing is that our implementation has time complexity of O(n³) which is very inefficient. But can we reduce it? Yes, we can modify our python_implementation function slightly to get lower time complexity but nevertheless, even if you minimize time complexity and operations done in Python it will never come even close to NumPy performance.

So how do we come closer and even beat NumPy? This is where we have to dig deeper and start extending our Python code with C code. Since Python is interpreted language it is slower than C which is compiled, so therefore latter will be much faster.

Implementing C module for Python to interact with

In c_extension.c file we will create actual logic for our dot product function in C which later will be callable from Python.

First thing’s first — imports.

//c_extension.c#include "Python.h"
#include "stdlib.h"
#include "omp.h"

#include “Python.h” line basically says that we are importing Python development header, which will allow us to pass, process and return Python data types and objects. If you have issues with importing Python.h header please visit here for solution.

We will need it since everything in Python is an object and C does not have ones.

stdlib.h — for accessing memory allocation functions.

Don’t worry about omp.h header for now if you don’t know what it is. We will get there later.

We will start off by writing a prototype of our dot_product function.

PyObject* dot_product(PyObject* self, PyObject* args)

From the official tutorial on embedding C into Python, this is how a C function that takes and returns Python objects should look like.

So we simply create a C function that can receive and return Python objects. Now we can implement the actual logic.

PyObject* dot_product(PyObject* self, PyObject* args) {
PyObject* mat1, *mat2;
if(!PyArg_ParseTuple(args, "O|O", &mat1, &mat2)) {
return NULL;
}
}

Now let’s go line by line what is actually going on here.

  • PyObject* self — is a default parameter that has to be in any C function that wants to interact with Python. You can think of it as a self parameter in Python object methods.
  • PyObject* args — this is actually where all parameters are passed as a tuple.
  • PyObject* mat1, mat2 — creating pointers of type PyObject where we gonna store our 2D arrays.
  • PyArg_ParseTuple — this is where all the magic happens. This function accepts all arguments passed to the C function (args in our case) as a tuple and parses them so you can assign them to some variables and refer to them later. The first parameter of this function is parameters that we accept as an argument (args). The second one is actually a pattern which tells this function what data types or objects you are expecting. In our example we have “O|O” which basically means that we are expecting 2 Python objects (in our case those objects will be 2 Python lists). For more info see this. The rest of parameters are just variables where our parsed arguments will be placed. Returning NULL in if statement will just throw an exception in case you don’t pass enough parameters as you specify in parsing pattern.

So now when we received Python lists within C, we can perform any type of operations on them. Getting dimensions will be the first thing to do.

PyObject* dot_product(PyObject* self, PyObject* args) {
//previous code omitted

//getting dimensions of our lists
int mat1_rows, mat1_columns, mat2_rows, mat2_columns;
mat1_rows = PyObject_Length(mat1);
mat1_columns = PyObject_Length(PyList_GetItem(mat1, 0));
mat2_rows = PyObject_Length(mat2);
mat2_columns = PyObject_Length(PyList_GetItem(mat2, 0));
}

Here we are just getting length of our lists with PyObject_Length function which points to number of rows. PyList_GetItem(object, index) function gets an element of list, just like in Python you would normally do list[index]. Then we take the length of 1 element which points to number of columns in our matrix.

Now when we have dimensions and arrays themselves, we can implement dot product logic.

PyObject* dot_product(PyObject* self, PyObject* args) {
//previous code omitted

PyObject *pyResult = PyList_New(mat1_rows);
PyObject* item, *mat1_current_row;
int total;
for(int i = 0; i < mat1_rows; i++) {
item = PyList_New(mat2_columns);
mat1_current_row = PyList_GetItem(mat1_current_row, i);
for(int j = 0; j < mat2_columns; j++) {
total = 0;
for (int k = 0; k < mat2_rows; k++) {
total += \(int)PyLong_AsLong(PyList_GetItem(mat1_current_row, k)) * \
(int)PyLong_AsLong(PyList_GetItem(PyList_GetItem(mat2, k), j));
}
PyList_SetItem(item, j, PyLong_FromLong(total));
}
PyList_SetItem(pyResult, i, item);
}
}

The same logic as we have already implemented in Python. Let’s clarify some moments here:

  • PyList_New stands for itself. We are creating Python list with length of mat1_rows.
  • PyLong_AsLong is converting python long object into C long, so that you can operate with it in C. Like here we add it to C integer total variable. There is no way to convert it to normal integer so we have to cast it to int after conversion.
  • PyList_SetItem stands for itself as well. It is similar to Python’s list[index] = value. But here it is just a function PyList_SetItem(listObject, index, value).
  • PyLong_FromLong is the same as PyLong_AsLong but vice versa. So we are converting C long into Python long object.

After these manipulations we have done dot product and are ready to return result to Python.

PyObject* dot_product(PyObject* self, PyObject* args) {
//previous code omitted

return Py_BuildValue("O", pyResult);
}

This is how you should return Python object from C to Python. Py_BuildValue as you could already guess will build Python object from value passed to it. First parameter indicates what type of object we want to build. In our case this is still list object, so you pass “O” pattern which points to any arbitrary object(for primitive data types like int, pattern would be “i” for example).

Last thing to do is add the following code:

//c_extension.c//previous code omittedstatic PyMethodDef module_methods[] = {
{"dot_product", (PyCFunction) dot_product, METH_VARARGS, "Calculates dot product of two matrices"}
};

static struct PyModuleDef cModPyDem = {
PyModuleDef_HEAD_INIT,
"c_extension",
"",
-1,
module_methods
};


PyMODINIT_FUNC PyInit_c_extension(void) {
return PyModule_Create(&cModPyDem);
}

where module_methods is method table where we define all methods that will be callable from our extension and PyInit_extension_c is initialization function that will be called when we start building our extension. I will not be going into the details but you can check out this official guide on how to setup method table and initialization function. The only thing to remember here is that PyInit_ should be continued with your extension name(c_extension in our case).

Building Python extension

Now when we are done with C, it is time to get back to Python. In order to use our c_extension.c file we have to build it as Python module. Create a setup.py file with following code :

# setup.pyfrom setuptools import setup, Extension
import os


os.environ["CC"] = "gcc-9"
ext = Extension(
'c_extension',
sources = ['path/to/your/c_extension.c'])

setup(name='c_extension',
version='1.0',
description='This is a demo package',
ext_modules=[ext])

Use os.environ[“CC”] only if you want to explicitly tell Python which compiler you would like to use. I use gcc-9 for my own convenience since my default compiler is clang.

Before building extension I would recommend you to create virtual environment in your project if you don’t have one yet. If you do, activate it before running following commands from the directory where your setup.py file is located:

python setup.py build_ext --inplace

or

pip install .

This should install c_extension module in your virtualenv.

Testing C extension

Now let’s get back to python_dot_product.py, import our extension and create function to test it.

# python_dot_product.pyimport c_extension
@timer
def c_implementation(arr1, arr2):
return c_extension.dot_product(arr1, arr2)

And run:

if __name__ == '__main__':
data = generate(size=500, range_=(1, 100))
numpy_time_taken, numpy_result = numpy_implementation(*data)
python_time_taken, python_result = python_implementation(*data)
c_time_taken, c_result = c_implementation(*data)
print(f"time taken with numpy: {numpy_time_taken} seconds")
print(f"time taken with python: {python_time_taken} seconds")
print(f"time taken with c: {c_time_taken} seconds")

Output:

time taken with numpy: 0.1746230125427246 seconds
time taken with python: 33.78820300102234 seconds
time taken with c: 2.838326930999756 seconds

Wow! Just look how significantly we have improved our performance speed. From 32–33 seconds to just 2.8! Now you can see how powerful this technique can be. Nearly 12 times faster performance just by using C.

But we can still do better. Currently our c_extension.c code is not optimized. This means we have not yet made any memory optimizations while performing matrix multiplication.

Memory arrangement and access

In order to understand how to optimize our code we have to understand some moments regarding memory arrangement and how processor accesses it.

Heap vs Stack

When we are talking about memory allocation and access we have to know where data is stored and how fast processor can access it. Stack and Heap are 2 places where data lives in program during runtime. The only difference between those two is that on the Stack data is accessed directly meanwhile on the Heap indirectly. But what does this mean?

When you create a variable with predefined memory size like int, or allocate memory of constant size, at the compilig stage, compiler knows how much memory it should allocate before runtime, so therefore these data go on the Stack. While being on the Stack it is accessed directly when needed. But when you use memory allocation functions like malloc, calloc etc… where the size of allocated memory is defined only during runtime, these data go on the Heap, and will be accessed via pointers(indirectly). So therefore accessing data from the Heap is always slower than from the Stack.

Memory arrangement

Every N dimensional array is ordered sequentially in memory. When it comes to memory access, computer likes related data together. Simply accessing sequentially is faster way to make calculations since there are no jumps between memory locations every time it needs to get a specific value. Two ways of representing 2 dimensional arrays sequentially are row-major and column-major.

If we convert our arrays from 2D to 1D, memory jumps will be eliminated + we will reduce cache misses.

Now, when we know what is the difference between accessing data from Stack and Heap, and memory arrangement we can start optimizing our code!

Optimizing C code

Since now we know what is the difference between accessing data from Stack and Heap d we can create arrays of predefined size in c_extension.c.

//c_extension.c#define MAX_SIZE 500*500
typedef double BASE_TYPE;
BASE_TYPE row_major[MAX_SIZE];
BASE_TYPE column_major[MAX_SIZE];

Now we can fill our predefined arrays with actual values but before that one quick note.

When we are dealing with Python functions like PyList_SetItem and PyList_GetItem, in fact they are very slow. So before any actual calculations it would be better if we move all the data from Python objects to C arrays. This will speed up the process.

//c_extension.c
BASE_TYPE
**init_result_array(int total_rows, int total_columns) {
//creating 2D array for copying Python list object into
BASE_TYPE **result_array = (BASE_TYPE **)calloc(total_rows, sizeof(BASE_TYPE *));
for(int row = 0; row < total_rows; row++) {
result_array[row] = (BASE_TYPE *)calloc(total_columns, sizeof(BASE_TYPE));
}
return result_array;
}
BASE_TYPE **convert(PyObject *ndimarray, int rows, int columns) {
//Unwraps Python list into C pointer to 2D array

BASE_TYPE **c_array = init_result_array(rows, columns);
PyObject *current_row;
for (int i = 0; i < rows; ++i) {
current_row = PyList_GetItem(ndimarray, i);
for (int j = 0; j < columns; ++j) {
c_array[i][j] = (BASE_TYPE )PyLong_AsLong(PyList_GetItem(current_row, j));
}
}
return c_array;
}

Also create row and column major transformation functions.

void transform_row_major(BASE_TYPE **ndimarray, int rows, int columns) {
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
row_major[i * columns + j] = ndimarray[i][j];
}
}
}


void transform_column_major(BASE_TYPE **ndimarray, int rows, int columns) {
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
column_major[j * rows + i] = ndimarray[i][j];
}
}
}

And finally function that will convert C pointer back to python 2D list object.

PyObject* build_python_array(BASE_TYPE** result_array, int rows, int columns) {
// Building Python result object from C 2D array pointer

PyObject *item;
PyObject *pyResult = PyList_New(rows);
for (int i= 0; i< rows; ++i) {
item = PyList_New(columns);
for (int j= 0; j< columns; ++j) {
PyList_SetItem(item, j, PyLong_FromLong(result_array[i][j]));
}
PyList_SetItem(pyResult, i, item);
}
return pyResult;
}

Now let’s implement dot_product_optimized function.

PyObject* dot_product_optimized(PyObject* self, PyObject* args) {
PyObject *mat1;
PyObject *mat2;

if (!PyArg_ParseTuple(args, "O|O", &mat1, &mat2)){
return NULL;
}
int mat1_rows, mat1_columns, mat2_rows, mat2_columns;
mat1_rows = PyObject_Length(mat1);
mat1_columns = PyObject_Length(PyList_GetItem(mat1, 0));
mat2_rows = PyObject_Length(mat2);
mat2_columns = PyObject_Length(PyList_GetItem(mat2, 0));

BASE_TYPE **mat1_c = convert(mat1, mat1_rows, mat1_columns);
BASE_TYPE **mat2_c = convert(mat2, mat2_rows, mat2_columns);
transform_row_major(mat1_c, mat1_rows, mat1_columns);
transform_column_major(mat2_c, mat2_rows, mat2_columns);
BASE_TYPE **result = init_result_array(mat1_rows, mat2_columns);
int tot, iOff, jOff;
for (int i=0; i < mat1_rows; i++) {
iOff = i * mat1_columns;
for (int j=0; j < mat2_columns; j++) {
tot = 0;
jOff = j * mat2_rows;
for (int k=0; k < mat2_rows; k++){
tot += row_major[iOff + k] * column_major[jOff + k];
}
result[i][j] = tot;
}
return Py_BuildValue("O", build_python_array(result, mat1_rows, mat2_columns));
}

And add it to method table.

static PyMethodDef module_methods[] = {
{"dot_product", (PyCFunction) dot_product, METH_VARARGS, "Calculates dot product of two matrices"},
{"dot_product_optimized", (PyCFunction) dot_product_optimized, METH_VARARGS, "Optimized version of dot_product"}
};

Build the extension and implement new function in python_dot_product.py.

I will leave only new solution + NumPy implementation.

#python_dot_product.py
@timer
def c_optimized_implementation(arr1, arr2):
return c_extension.dot_product_optimized(arr1, arr2)
if __name__ == '__main__':
data = generate(size=500, range_=(1, 100))
numpy_time_taken, numpy_result = numpy_implementation(*data)
c_optimized_time_taken, c_optimized_result = c_optimized_implementation(*data)
print(f"time taken with numpy: {numpy_time_taken} seconds")
print(f"time taken with optimized c: {c_optimized_time_taken} seconds")

Output:

time taken with numpy: 0.18149495124816895 seconds
time taken with optimized c: 0.42600393295288086 seconds

See how just by performing memory optimizations we can significantly reduce execution time. From 2.8 seconds to 0.42! It is almost 7 times faster than straightforward C implementation. But still we are about 4 times slower than NumPy.

We have already gone far but still we are not done. There is 1 more technique that we can implement to outplay NumPy.

Process parallelization

This is where we are going to beat NumPy. Using parallel processing will allow us to split our for loops into n parts where every thread is going to calculate its part. OpenMp library will do all the work for us. This is where we need #include “omp.h” line. To make his happen we have to modify our code slightly.

PyObject* dot_product_optimized_parallel(PyObject* self, PyObject* args) {
PyObject *mat1;
PyObject *mat2;

if (!PyArg_ParseTuple(args, "O|O", &mat1, &mat2)){
return NULL;
}
int mat1_rows, mat1_columns, mat2_rows, mat2_columns;
mat1_rows = PyObject_Length(mat1);
mat1_columns = PyObject_Length(PyList_GetItem(mat1, 0));
mat2_rows = PyObject_Length(mat2);
mat2_columns = PyObject_Length(PyList_GetItem(mat2, 0));
BASE_TYPE **mat1_c = convert(mat1, mat1_rows, mat1_columns);
BASE_TYPE **mat2_c = convert(mat2, mat2_rows, mat2_columns);
transform_row_major(mat1_c, mat1_rows, mat1_columns);
transform_column_major(mat2_c, mat2_rows, mat2_columns);
BASE_TYPE **result = init_result_array(mat1_rows, mat2_columns);
#pragma omp parallel num_threads(6)
{
int tot, iOff, jOff;
#pragma omp for
for(int i=0; i < mat1_rows; i++) {
iOff = i * mat1_columns;
for(int j=0; j < mat2_columns; j++) {
tot = 0;
jOff = j * mat2_rows;
for(int k=0; k < mat2_rows; k++){
tot += row_major[iOff + k] * column_major[jOff + k];
}
result[i][j] = tot;
}
}
};
return Py_BuildValue("O", build_python_array(result, mat1_rows, mat2_columns));
}

The same as before but now we have defined #pragma omp parallel scope which tells the program that this section will be parallelized. With num_threads you define how many threads will be created. I recommend you to create the same number as the number of cores in your processor. #pragma omp for makes our for loops run in parallel with no errors because without it you would manually have to distribute all the work between threads. For example if our i was going from 0 to 36 it would be split this way:

  • Thread num. 1 — iterations from 0 to 6
  • Thread num. 2— iterations from 6 to 12
  • Thread num. 3— iterations from 12 to 18
  • Thread num. 4— iterations from 18 to 24
  • Thread num. 5— iterations from 24 to 30
  • Thread num. 6— iterations from 30 to 36

This all is handled by #pragma omp for.

In fact there is much much more what OpenMp can provide you with. You can check out the official documentation.

To make it work we have to modify our setup.py file to tell the compiler that we want to use OpenMp library.

from setuptools import setup, Extension
import os


os.environ["CC"] = "gcc-9"
ext = Extension(
'c_extension',
sources = ['path/to/your/c_extension.c'],
extra_compile_args=['-fopenmp'],
extra_link_args=['-lgomp'])

setup(name = 'c_extension',
version ='1.0',
description='This is a demo package',
ext_modules=[ext])

You need to add extra_compile_args and extra_link_args parameters, so now compiler will know that you want to use OpenMp library and all #pragma statements will be taken into account. This is where I recommend you to use gcc compiler if your default compiler is clang(Mac users).

Now build the extension and implement test function.

#python_dot_product.py
@timer
def c_optimized_parallel_implementation(arr1, arr2):
return c_extension.dot_product_optimized_parallel(arr1, arr2)

And run:

if __name__ == '__main__':
data = generate(size=500, range_=(1, 100))
numpy_time_taken, numpy_result = numpy_implementation(*data)
c_optimized_time_taken, c_optimized_result = c_optimized_implementation(*data)
c_optimized_parallel_time_taken, _ = c_optimized_parallel_implementation(*data)
print(f"time taken with numpy: {numpy_time_taken} seconds")
print(f"time taken with optimized c: {c_optimized_time_taken} seconds")
print(f"time taken with optimized parallelized c: {c_optimized_parallel_time_taken} seconds")

Output:

time taken with numpy: 0.1738600730895996 seconds
time taken with optimized c: 0.4289052486419678 seconds
time taken with optimized parallelized c: 0.06976795196533203 seconds

That is something, huh? Now we are approximately 3 times faster than NumPy!

In fact, you can use this example to extend your Python code with C for any task that involves numerical computing. This gives you advantage in speed.

Visualizing performance speed

Here is visualization of performance speed of different implementations. I will test them on matrices with size of 100x100 up to 1400x1400 with step 100. Using matplotlib:

#python_dot_product.py
import matplotlib.pyplot as plt
def writeStats(stats, f, data, size, implementation):
time_taken, _ = f(*data)
stats[implementation].append({'size': size, 'time_taken': time_taken})
return stats
if __name__ == '__main__':
implementations = [f for f in globals().keys() if callable(globals()[f]) and '_' in f and f != 'python_implementation']
stats = {k: [] for k in implementations}

for i in range(100, 1500, 100):
data = generate(i, range_=(1, 100))
for implementation in implementations:
stats = writeStats(stats, globals()[implementation], data, i, implementation)

for implementation, stats_ in stats.items():
plt.plot([n['size'] for n in stats[implementation]], [n['time_taken'] for n in stats[implementation]], label=implementation)
plt.legend()
plt.xlabel('Matrix size')
plt.ylabel('Time taken')
plt.show()

Output:

Conclusion

Now you know how to extend your Python code with C. In fact this can come in handy in many situations. Dot product example is just a demo of how powerful this can be.

--

--