Writing extendable and hardware agnostic GPU libraries

Published in

techburst

5 min readNov 6, 2017

When I started programming in Julia around 4 years ago, GPU support and the ability to easily extend libraries was a big factor in my decision to choose Julia. Ironically, Julia didn’t even have great GPU support at that time, but the fact that it was based on LLVM, which is used by many GPU projects and vendors as the go-to compiler framework, and that it has a straight mapping from high level to native code gave me hope that it will offer excellent GPU support!

Now we’re finally at a point, where the kind of GPU support I imagined is coming along.

Before I go back to the topic of this article, let me reiterate why I believe that Julia has great potential for scripting the GPU:

Predictable performance as it is possible to write subsets of your program completely statically inferable
Great meta programming and code generation abilities to implement even the most tedious optimizations elegantly
Simple C interface without overhead
Interactive high level programming rivaling languages like Python
With a type system, inheritance, functional constructs and duck typing, existing libraries are easy to extend and hack — this together with Julias github based package manager creates a lively open source community

I believe that these are also the corner stones for creating an active ecosystem for Julia’s GPU support.

So, let us see how one can write generic libraries for GPUs using Julia — or in other words, what can be achieved when GPU compilation is a natural part of a high level language!

I’m the developer of GPUArrays, CLArrays and Transpiler so I will focus on these in this article.

The core library for writing hardware independent GPU code is GPUArrays. It defines the basics to launch kernels and offers some generic array algorithms and types. It is completely abstract without relying on a certain compiler framework and doesn’t need any GPU to be present, which is great news when you want to include GPUArrays as a dependency to your project.

To actually run a computation on the GPU, one needs a package like CuArrays or CLArrays which inherit from GPUArrays. Those libraries interface with the driver and link up the compiler backend (namely CUDAnative and Transpiler). As the name suggests, they are targeting CUDA and OpenCL respectively.

The most fundamental function exposed by GPUArrays is named gpu_call. It can be called as gpu_call(function, A::GPUArray, args) and will call function with the arguments (state, args...) on the GPU. state is a backend specific object to implement functionality like getting the thread index. A GPUArray needs to get passed as the second argument to dispatch to the correct backend and supply the defaults for the launch parameters.

Lets use this to implement a simple map kernel:

Now one must use CL-/CuArrays to get a concrete implementation for this function and run it on NVIDIA, Intel and AMD hardware. linear_index is a high-level function wrapping the well known thread_idx/local_idx functions in a hardware independent way. To see the (still incomplete) range of hardware agnostic functions, which are specific to the GPU execution model, have a look at GPUArrays documentation.

Now with the above map! implementation, we can start implementing more high level functions:

Note, that this time I typed the arguments with AbstractArray and not GPUArray. This makes the code work for any AbstractArray that implements map!, which should be almost any Julia Array:

replace_with([1, 2, 3, 4, 5], 2, 42) # normal julia arrays work!

This decouples our library function from the need of a GPU, which is great for debugging, CI, computers without a GPU, or simply people who want to try out the library even after they failed to correctly setup the GPU drivers (<rant> yes, it’s 2017 and it’s still not trivial to setup CUDA/OpenCL on all platforms </rant>).

Another thing to note is, that this works for any immutable datatype. To demonstrate this, let’s create a fixed size string type for the GPU:

Now we can use this custom type for our little toy demo:

So these two functions already cover a great amount of functionality by being truly generic. In this manner, GPUArrays implements lots of generic functions like broadcast with syntactic loop fusion, maps, mapreduce, constructors etc — and all of those work for custom types and functions. It also wraps BLAS in an abstract way with the help of Julias AbstractArray interface, which already routes a lot of high level routines to BLAS calls. So CLArrays only needs to overload these functions:

Which gets us a huge range of BLAS functionality for free and enables that A * B directly maps to a call to CLBLAS.clblasDgemm when A and B are 2D Matrices or to CLBLAS.clblasCgemv when B is a 1D Vector. CU-/CLFFT is also wrapped in this way and we’re working on wrapping more existing high performance GPU libraries. Note that with the above mentioned broadcast with syntactic loop fusion, one gets fusion of kernels for free, even for user defined functions:

To get a better idea of what else is possible with Julia on the GPU, I recommend reading Mike Innes great blog post Generic GPU Kernels. It shows a few more tricks, and for example, how generic kernels aren’t only fun to write, but also make it possible to do automatic differentiation for all kinds of functions without any performance penalty on the GPU! This is heavily used in Flux.jl, a machine learning library entirely written in Julia. It also uses GPUArrays to decouple the design from any GPU framework, but makes it easy to optionally add those in to accelerate the algorithms.

Now you may think that with closures, passing around of functions and untyped generic code — this clearly must have severe performance drawbacks!

Luckily, this isn’t true. After all, Julia is the master of zero cost abstractions! I don’t want to go into details here, since this could be a complete blog post of its own. But the magic happens via aggressive specialization, inlining and a compiler, that, together with Julias type inference, can basically statically compile functions at runtime! Also the LLVM compilation of GPU code is highly competitive as can be seen in Tim Besards NVIDIA blog post introducing CUDAnative. Benchmarks taken from the blog post show this nicely:

Figure 1. Performance difference between CUDA C++ and CUDAnative.jl implementations of several benchmarks from the Rodinia benchmark suite. CUDA code has been compiled with CUDA 8.0.61, for an NVIDIA GeForce GTX 1080 running on Linux 4.9 with NVIDIA driver 375.66, comparing against CUDAnative.jl 0.4.1 running on Julia 0.6 with LLVM 3.9.1.

I started writing a more generic benchmark suite for all kind of GPU libraries, which also shows neat performance for the OpenCL code generated by the Transpiler. Here is an excerpt from GPUBenchmarks:

embedded gists don’t seem to have active links, so here is the original

As a last example, I ported a smoke simulation for GPUArrays. It is part of the GPUShowcase suite which I recently started putting together. I haven’t fully benchmarked and optimized the code, but it got around 15x faster, making it usable for real time simulations even for a large amount of particles.

Rendered with GLVisualize, which is a GPU accelerated visualization library I’m developing.

I’m quite happy with the code, considering that one can easily swap out the array and element types. This makes it easy to run this on the CPU or GPU by just changing the array type, and one can try out different precisions and see how it changes the outcome of the simulation.

All in all, I hope this blog post illustrates how nice it can be to write GPU code using Julia! I also hope it helps people to write fast GPU code for AMD, NVIDIA and Intel hardware. Heck, probably even FPGA or other interesting hardware that OpenCL supports!

Let’s create the first open source ecosystem of generic high performance GPU libraries, which are as easy to install as Pkg.add("GPUArrays")!

Writing extendable and hardware agnostic GPU libraries

Written by Makie