Creating C Extensions for Python with Numpy and CUDA [Part 2]

Part 2/2: If you didn’t read Part 1, which covered the basic setup of a Python C module, click here. A full Github repository containing all this code and setup instructions is here.

We’re going to dive right away into how to parse Numpy arrays in C and use CUDA to speed up our computations. Every Numpy object is internally a PyArrayObject, with a data pointer, a shape tuple (think arr.shape in Numpy), and a stride tuple that defines how to access the nth entry along each axis. While we don’t need to get into the weeds, the stride is a tuple which specifies how many bytes you need to step along each axis to get to the next element, say from arr[0] to arr[1] or arr[:, 0] to arr[:, 1]. For a one-dimensional array, the stride will just be a one entry tuple whose size is the size of the datatype (8 in the case of doubles, 4 for 32bit floats, 4 for integers, etc). For 2-dimensional arrays, this can be more complicated, depending on whether the array is stored in row-major (the default) or column-major order. An excellent explanation can be found here!

Now what does that mean for us? Let’s try writing a function to parse two Numpy float arrays and add them together (same as the addition function in Numpy):

To actually compile this, we’re also going to have to make a few changes to our setup code. Numpy requires us to include the numpy/arrayobject.h header file, and we need to makes sure our compiler can find that path. We can manually specify include paths in, by adding an include_dirs kwarg to the ext_modules list. The include path should point to your Numpy installation, for example

Note that we find the Numpy include path using the np.get_include() function. The core code is not too complicated. We parse two PyObjects as PyArrayObjects with the “O!” flag, and check to make sure they are one-dimensional and have the right type. Then we malloc the output array, add the two arrays together, and repackage them. We cast them to (double *) because this has the correct stride of 8-bytes. We could also have cast them to (char *) and then used array1 -> strides[0] * i as our offset. The PyArray_SimpleNewFromData function takes (num_dimensions, shape tuple, data_type, data_ptr) and creates an appropriate PyObject. Now try rebuilding this and running it! Pretty cool. This can be extended to arbitrary dimensions and arbitrary datatypes as well. Now you can run this code from Python with:

A few words to the wise: when using the Numpy array classes, you have to call import_array() in your PyInit function. You also need to be careful about stride. It’s easy to get nonsense answers if your arrays are accessed the wrong way. Multi-dimensional arrays are complicated when stride is involved.

Now we’ve finally come to CUDA. CUDA can operate on the unpackaged Numpy arrays in the same way that we did with our for loop in the last example. The only difference is writing the vectorAdd kernel and linking the libraries. We are going to more or less copy the example from the official CUDA samples, adapting it to take the vectors as input. We’re also going to compile the CUDA function as a static library and link it into our module at compile time. Let’s start with the CUDA kernel:

I will exclude error checking for space considerations, but the entire codebase can be found on my Github. This is a simple program that takes two pre-allocated double arrays (extracted from our Numpy arrays), copies them to the GPU, and then adds them using a kernel and copies them back. Nothing too complicated. This is taken from the CUDA samples, so you can find more documentation there. We copy the provided host arrays into CUDA memory using cudaMemcpy after allocating the memory with cudaMalloc. We compute the correct number of threadsPerBlock and blocksPerGrid using the (numElements + threadsPerBlock — 1) / threadsPerBlock formula and launch the kernel.

To incorporate this into our module, we will take the previous Numpy example verbatim, except we’ll substitute our for-loop addition with a call to this function. We also have to declare the prototype somewhere. We replace the malloc call with a call to this function.

You probably get the idea by now. Now all that remains is to compile the script to a static library and link it. This will be heavily dependent on your exact installation, but the following Makefile ought to work with minor alterations:

Note that the main script calls a python script scripts/ This replaces the LDSHARED value in the Makefile under the python tag using the script

Make sure this is in the appropriate location. For some unfortunate reason, gcc fails to link the library with nvcc, so we must use g++ or another C++ compiler to perform linking. What we’re doing is compiling the .cu code as a static library and then linking it with the module we’re writing. In your script, you’ll also have to specify all the libraries you want to link. You can do that with:

Basically, we need to include the CUDA headers and link dynamically with the library. Here we track down all the dependencies we need and configure the script. Now you should be able to run

and then follow the above example with vector_add_cuda to add two Numpy arrays together. This too can be arbitrarily extended to multi-dimensional Numpy arrays and other data types.

I hope this was helpful. There’s so much more you can do, but struggling through the installation and setup procedure is the big bottleneck. Now you can look into Numpy internals, stride, multi-dimensional arrays, more complicated CUDA kernels, persistent CUDA objects, and more! Good luck!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jacob Austin

Student at Columbia University. Interested in artificial intelligence, software engineering, and music.