Parallelizing Custom CuPy Kernels with Dask

Peter Andreas Entschev
RAPIDS AI
Published in
4 min readAug 15, 2019

Summary

Some time ago, Matthew Rocklin wrote a post on Numba Stencils with Dask, demonstrating how to use them for both CPUs and GPUs. This post will present a similar approach to writing custom code, this time with user-defined custom kernels in CuPy. The motivation for this post comes from someone who recently asked how can one use Dask to distribute such custom kernels, which is a question that will surely arise again in the future.

Sample Problem

Let’s get started by defining the problem we will look at in this post. The problem here is merely for demonstration purposes, not very complicated, and can be achieved in many other ways, including directly (and quite easily) using CuPy built-in functions.

Assume you have two arrays, a 2D array x and a 1D array y, with dimensions MxN and N respectively. We will add those arrays together, in other words, we will add y to all M rows of x. As I mentioned before, the kernel we will look at is quite simple, but the data organization is overly complicated for this example on purpose, allowing us to demonstrate how to use map_blocks for more complex situations, which are likely to happen in practice.

This example using CuPy alone would look something like the following:

Writing a custom kernel with CuPy

For those familiar with CUDA kernels, this section will be easily understood. If you’re not familiar with that, feel free to skip this section, the upcoming session entitled “Using Dask’s map_blocks" may serve as a useful reference for parallelizing custom code with Dask nevertheless.

To write a user-defined kernel, we will use the cupy.RawKernel function, but CuPy contains also specialized functions for elementwise kernels and reduction kernels that we will not cover in this post. This is a very simple function, taking only three arguments: code, name and options. As implied, the first two arguments are used to define the kernel implementation and its name, the latter is used to pass optional arguments to NVRTC (NVIDIA's RunTime Compiler), but will not talk about this last argument during this post.

We will need a dispatching function for that kernel as well. This function will define a block size and compute the grid size based on the input sizes and the block size defined, call the kernel we created before and finally return the result.

We can now use that dispatch function and compute the sum of x and y. Note that we create a third array z in the function above, that is where the output of the function will be stored.

Note in the code above that we calculate the length of internal dimension from the stride of the array, that is important because each of the Dask array blocks is going to be a view of the original CuPy array, so the stride will not be the same as the block’s shape for the array x. For z we could have just used its shape, since we are creating a new array in the function instead of getting a view of the array.

The function above is equivalent to the previous res_cupy = x + y.

Using Dask’s map_blocks

The map_blocks function in Dask is a very powerful tool, allowing developers to write custom code and parallelize it in terms of the blocks of a Dask array. Not only custom code can be written with map_blocks, existing functions from libraries such as NumPy can be parallelized with its help too, if there's no existing Dask implmentation for such function yet. In fact, various Dask functions do exactly that, use map_blocks as a support for NumPy parallelization.

So how do we use map_blocks? First, let's create Dask arrays from the CuPy arrays x and y that we created previously, respectively calling them dx and dy.

What is important to note here is the need match array and block sizes properly. In this example we have x with size 4096x1024 and y with length 1024, so the length of N must match. The same is true for block sizes, here we are creating 4 blocks on the first dimension and 2 on the second dimension of x, thus we need to ensure that y is also split into 2 blocks.

The only thing left to do now is create a map_blocks task for the operation we want execute. For this example, we will pass 5 arguments in total, our dispatch function, the Dask arrays and the dtype.

If we compute the Dask array and print its output, we should see a matching result with that of the CuPy array:

And now we’re done!

Conclusion

The topic discussed in this post may seem like a lot of work, and it actually is if we consider that we can do the same in about three lines of code. But now consider writing a small CUDA kernel that is custom and highly optimized, the alternative to this would probably be creating a C++ file, writing both host code and the CUDA kernel, compiling it, exposing the API through CPython (or something similar). The latter is much more work than the former.

This post explained how CuPy and Dask allow you to extend their capabilities. This gives users the capabilities to write self-contained and easily maintainable applications, even when lower-level language code (such as CUDA) is needed to provide best performance.

Complete Code

--

--