Rather than reinvent the wheel and context with every kernel, use an application server to manage everything so we can just focus on device code.

All Things GPU: Part 3

Published in

HashiCorp Solutions Engineering Blog

14 min readNov 17, 2019

Intro to Mawenzy: Simplified GPU App Server

In Part 2 we showed the basics of how to enable and list our OpenCL platforms and their devices. We also showed external examples of how to set up a context and command queue on a device. Once you have OpenCL installed and the clinfo command shows a valid device of at least version 1.2, we can set up Mawenzy which automates all of this for us. The name Mawenzy comes from “Mawenzi” the lesser-known sister peak of Mount Kilimanjaro which is an apt analogy of your CPU and your GPU. In this section we need to build some basics before we can jump into more fun use cases. Before we go push the limits of our hardware and do something impressive we need to start with some basic Hello Worlds and dig into bits with a hex calculator.

A while back I pitched Mawenzy to YCombinator, who put me through a few remote interviews to learn about this idea. I was asked a great question: “why hasn’t anyone else done this already?” I answered that today’s generation of coder cares more about Python and web applications than GPU tech which is niche and very different from Python. I was flown from London to Mountain View for a 10 minute interview slot with a fresh interviewer who without even viewing the demo said it was a good idea but I should come back when I can make it run Python.

A GPU will not run Python which isn’t designed for parallelization.

A while back I saw a public github repo where someone tried to start a project like Python on GPU but it was obviously futile as the language doesn’t befit parallel computation. That doesn’t mean the GPU can’t process web content, as we’ll see later.

Mawenzy is a service instead of a one-shot application, it can be easily be scheduled by Nomad, which supports native CUDA features as welel as generic OpenCL features when exposing a GPU from the host. It can also be meshed via Consul if exposed via TCP. Read and write of GPU buffers is as simple as copying a file since Mawenzy presents a filesystem directly on the GPU, similar to tmpfs — or more accurately, ramfs. As these files/buffers are backed by VRAM and not main memory, their backstore is much faster than any existing filesystem including RAMFS and TMPFS but in fact the buffers are bottlenecked by the CPU and main memory. FUSE kernel mode switches generally render FUSE filesystems impractical for block storage and I fully agree with Linus Torvalds that FUSE is a toy, but toys can still be useful. In this case they can also be run without root and locked down only to our user.

Mawenzy is built for Linux with support for AMD, nVidia, and Intel GPUs that have OpenCL installable client decoder (ICD) supporting versions 1.2+. Note that NVidia has little interest in further supporting OpenCL with their CUDA focus, so the latest version most CUDA devices support is likely 1.2. That said NVidia support of OpenCL 1.2 is pretty solid in my experience. Some mobile or embedded devices like Raspberry Pi also support an embedded subset of OpenCL. Whatever device you use we’re going to live on the GPU and do as much as we can using as little CPU intervention as possible.

COPR Repo

The installer available via the COPR build system here: https://copr.fedorainfracloud.org/coprs/boeroboy/mawenzy/. It can be installed on any EL 7/8 or any RPM-based distro:

$ dnf copr enable boeroboy/mawenzy
$ dnf install mawenzy

To keep it simple let’s only use one device. Select the platform and device ID you’d like to use. Then we’ll configure those as environment variables and run the Mawenzy Server daemon or “mawd.” The simplest configuration is platform 0 and device 0. In my case I’ve enabled only the NVidia platform and device zero is my primary graphics, an nVidia GTX 1060. Normally we’d run this as a Nomad service or a SystemD service but here we’ll run it manually with debug output. Remember Mawenzy is also a FUSE filesystem so we need to specify where to mount it. I will mount my instance at /mnt/gpu, a directory I’ve made.

Running Mawenzy Daemon in debug mode from bash. FUSE params -s (single thread) -f (foreground) simplify debugging as we learn. Socket at default /tmp/mawd.sock. Later we’ll discuss other useful options including wsize,rsize and other environment variables for customized experience.

Now the service has a context on our device. To stop the server we must unmount the filesystem via “fusermount -u /mnt/gpu” at any time. We can focus on the bit we care about — writing and running GPU code. Mawenzy is split into two parts — a server (Mawd) and a client (Mawsh). The good news is that the protocol is created to run kernels as simply as you run a script on the host. We can start with two basic hello world examples in Mawsh. Remember basic OpenCL is a subset of C99 and looks like C with no “main” required. Here is a file called hello1.mawsh:

#!/usr/bin/mawsh
void kernel hello()
{
  printf("Hello, world!\n");
}

That’s all! By using a shebang on the first line (#!/usr/bin/mawsh) we can make the file executable as a script. In this case, executing it will call the mawsh client installed at /usr/local/bin/mawsh with the argument of our file path as well as any extra args we pass. I also keep a dev copy of mawsh in $HOME/bin/ as some examples show. This makes it super simple to write GPU code in just a few lines. This simple illustration gives us the most basic demonstration of how code is run:

Execution of ./hello1.mawsh
Mawsh client runs with our file (hello.mawsh) and args (none).
Mawsh sends inode+mtime to MawD service socket.
If MawD does not have this program already, Mawsh sends it to compile.
MawD creates a command queue on our selected device.
MawD sets up all program kernels and their args.
Our command queue runs all of them top to bottom.
Here the only kernel is “hello” with no dimensions (default 1).
MawD returns any result to Mawsh client.

This may seem complicated but it simplifies quite a bit. If we run a new program or modify our program, MawS automatically recompiles it or responds with any syntax error present. Further runs of this script will use the already compiled code, saving time. Some compilers also cache programs in your home directory which can create double-caching issues where you can’t make changes to your code taking effect. In this case (NVidia) you can disable the native caching with environment variable CUDA_CACHE_DISABLE=1.

Not many OpenCL guides cover this basic hello world example but I think it’s an important one to cover. The reason nobody covers it is its horrible inefficiency. This is a super simple first kernel to run but printf is a terrible performance drag. As such you should never use printf except for debugging. Instead we need a better mechanism for variable length output.

Strangely, Mawsh will not show any output from this run. Since we’re using OpenCL’s built-in printf command the MawD process itself shows this “Hello, World!” As contexts can’t be shared across processes, all of our worker threads share the same stdout which is unfortunate and is highly inefficient with a synchronization barrier.

Two runs of hello1.mawsh. First run we can see Mawenzy compile and run it on the GPU. Second attempt only runs it as it’s already on the GPU. Note stdout is appearing in the daemon output, not our own terminal because we used printf. Also notice kernel times are reported and vary between runs.

MawProc and <mawsh.cl> subsystems.

What we need is our own buffer space. Instead of a global printf for the whole process, we need a buffer space for our command queue and thread state. Ideally we don’t rely on printf which ties up the context and has massive performance penalty. Instead we need an automatic buffer and environment to live through the entire program. That’s is what MawD gives us with mawproc. The struct “mawproc” is defined in our system header #include <mawsh.cl> and includes buffer space for us to write to and share between kernels if necessary. Any kernel in Mawsh may use this implicit parameter and MawD will create, assign, read, and destroy it automatically. You can assign your own buffer to “p” if you want to manage your own and persist it. OpenCL 2.0 introduced pipes for passing data between kernels but Mawenzy supports OpenCL 1.2+ as is the case with the NVidia card I’m using in this demo. Later we’ll discuss other headers that Mawenzy includes in /usr/include/mawsh for other utilities, such as image management, vector math, and text analysis.

For now let’s try a second version of hello world where we actually make use of Mawenzy’s own mawproc subsystem instead. Here is hello2.mawsh:

#!/usr/bin/mawsh 
#include <mawsh.cl> 
 
void kernel hello(global mawproc* p) 
{ 
    printmc(&p->out, "Hello, world!\n"); 
}

This time we use a Mawenzy print function, mprintc (Mawenzy Print Constant). This sends our output to a mawproc buffer in the GPU managed by Mawenzy and shows it in our shell.

Now we have a multi-threaded service that can execute multiple kernels in the background even as this GPU is driving the display. Note both print functions are pretty big performance hits — especially in this case tying up a compute unit with a single work item. These examples are simple starters that can be run identically on Intel, AMD, or NVidia GPU or CPUs. In fact here is the exact same code running on Skylake NUC Gen9 Intel Graphics. Oddly it even has faster top speed at 9,250ns vs 19,520ns. The Intel GPU has fewer cores but faster clock speeds. Benchmarks should not be measured by this simple kernel as large parallel jobs will vastly outperform on a powerful external GPU especially when not also driving a display.

Same hello2.mawsh running on Intel Skylake Gen 9 embedded GPU (48 CU x 8 cores = 384 cores). Writing GPU code is now as simple as writing a script!

Note that if you touch or edit a mawsh file and rerun it, Mawenzy will automatically rebuild it or tell if you have a syntax error. Recompiling does not apply to external files that you #include, so be sure to touch or update mtime if you want to force a program to rebuild.

Mawsh client will rebuild after edits and instantly pass on syntax errors if you’ve goofed.

Enough of hello world. Isn’t the GPU supposed to be good at massive parallel jobs? Rather than use a single work item, let’s now build a bigger wavefront. Let’s add a kernel that replicates one of my favourite examples by Erik Smistad. Let’s replicate a basic vector addition. Vector arithmetic is useful for risk or financial analysis and HFT. Comparing price or quantity changes over time is made super simple on GPUs. It’s good to check the original source code which also includes all host setup. Now we don’t need that host code since Mawenzy has our context already.

What about parameters and buffers? If we’re going to add 10,000 numbers to 10,000 other numbers and get 10,000 sum numbers, we’ll need buffers to store that. Instead of the example which uses sequential numbers, we’ll generate random numbers from the host. First we’ll build on the hello2.mawsh code and add new kernels adapted from Erik Smistad’s example above.

Evil vs good. For loop in a task — kernel with dimension 1, and a parallel kernel Mawsh runs with dimensions 1024. Even better, mawsh.cl has dimension shortcuts. The mawproc is unnecessary here as we’re not printing output.

Let’s start with a bad version of it. We’ll use a single work item and a for loop to add up 1024 integers. This is a perfectly valid kernel that performs OK on a CPU device but abysmally on a GPU device. Then we’ll optimize it for GPUs with kernel dimension 1024, adding everything in parallel with what’s called a wavefront. We can compare the speeds of a single nasty for loop versus a wavefront that activates up to 1024 GPU cores at once.

Make your GPU cores dance in a “mawsh pit”

The key to spinning up thousands of parallel workitems quicker than your CPUs can spawn a few threads can be found below:

An OpenCL program in Mawsh that contains two versions of vector addition. The first kernel runs a for loop in one work item, also called a task. The second kernel does the same thing in a wavefront of dimension 1024, essentially broadcasting it in parallel to available compute units and their cores.

We’ve added kernels named “vectoradd” and “vectoradd_1024” which brings us to work dimensions. The “_1024” means we want this kernel to run with an ND range of 1024. Incidentally mawsh kernels can’t include underscore “_” in the name as that’s reserved to indicate the start of work dimensions. We’ll broadcast this 1024 times in parallel as hardware allows. This Quadro RTX 4000 has 2304 cores divided across 37 compute units (~64 cores per CU). The job should take 1024/64=16 compute units at once if they’re available. Mawenzy reads all the kernels out of our program and checks for dimensions in the name. It also assigns all parameters as read from the commandline. If I need the parallel equivalent of a nested for loop that operates on a two or three dimension buffer or image — say 1024x1024 pixels or elements, I can write it as “vectoradd_1024x1024” and so on. I can also parameterize dimensions by using a single letter variable such as “vectoradd_$a” or “vectoradd_$ax$b” which allows us to specify work dimensions as a parameter without recompiling the program. So how do we create (1024 x 4 byte int) buffers for vector a and b? We need to make use of Mawenzy’s filesystem for this. We can copy files to the GPU or just copy 4k from /dev/urandom to a new GPU buffer in VRAM simply using dd:

$ dd if=/dev/urandom of=/mnt/gpu/testa bs=4k count=1
$ dd if=/dev/urandom of=/mnt/gpu/testb bs=4k count=1

Now that we’ve created two 4k GPU buffer objects full of random integers from the CPU we can run vectoradd.mawsh, but what about the destination buffer called C? We can create that during the run itself. Using parameters with mawsh follows a pretty simple set of rules:

You can define numbers or buffer parameters inline during runs.
Parameter names trickle down consistently to all kernels in the sequence.
Buffer references start with “@” at filesystem base: c=@testc
If you want a buffer to be created if it doesn’t exist, you can specify its size in brackets: c=@[4096]testc.
You can abbreviate “k” or “m” for KB/MB: c=@[4k]testc
You can use directories in the filesystem but all paths are relative to the MawD filesystem root, not system root: c=@dir/testc, not c=@/mnt/gpu/dir/testc.
You have option to specify or override the path to MawProc buffer “p” to your own buffer if you want it to persist your stdout or use an extra large process buffer. Otherwise MawD will still create and destroy its own temporary buffer p.

First we create 2 4k random buffers, then we run vectoradd.mawsh, specifying the buffers testa, testb, and creating testc with the result. For feedback, we leave in a print task (kernel with default dimension 1) to say hello and goodbye. Kernel runtimes are inconsistent but the wavefront of 1024 is always much faster.

This is all well and good but we’d need a hex editor or binary calculator to check our math on the host. GPUs are great at speaking binary and less so at speaking ASCII, Unicode, or variable-length human of any kind. Still, we’re going to need a better way to exchange meatspace data. If we leave our data on the GPU for later we don’t need to worry about any PCI-E transfer slowing us down, but eventually we’ll need to get out a more useful result. In the next section we will explore a technique for exchange similar to JSON but using OpenCL. Call it OpenCL Object Notation or CLON (technically C99ON, but CLON sounds better right?)

We can view these raw buffers directly on the GPU with the hexedit command. They check out. Note all are 4 byte little endian signed integers. But enough hex!

Now that we’ve covered basic parallel OpenCL with Mawsh, it’s worth noting that mawsh.cl includes a few other shortcuts. Let’s try one more example and replicate the saxpy examples from section 2. SAXPY stands for “Single-Precision A·X Plus Y” and is the classic parallel computing use case. It’s very simple to the vector addition above except using floats instead of integers, and y=a*x+y as a formula. We’ll simplify the use of get_global_id using a #define included in mawsh.cl. #define G0 get_global_id(0), #define G1 get_global_id(1), etc. Instead of a long-winded get_global_id for our dimension, we can just use G0:

Basic SAXPY example with Mawsh’s G0 shortcut. Visual Studio Code with extension for OpenCL modified to include *.mawsh extension. As this is the primary GPU and it’s running with adaptive power management, it can take a few runs to hit max performance. There is no extra kernel for feedback.

The examples repo also includes a two dimension version of this using 1024x1024 buffers. 1024x1024x4 byte floats = 16MB buffers required. According to clinfo the maximum work dimensions for the RTX 4000 is 1024x1024x64. Also note the constant qualifier for float *x. Constant or read-only address space is much faster to access than global but most devices have a limit on the size of a constant buffer parameter. The limit of this device is only 64KB sadly, and any buffer bigger than that will need to be global in this case.

We’ll cover optimization later but in general you can tell if a problem is an ideal candidate for a parallel algorithm if the data elements are fixed length, independent, and a work item’s input/output factor looks like the following:

We’ll dive into this in the next sections.

Nomad Jobs for GPUs

What about running this from Nomad if we want to scale? We can run this simply with Nvidia devices using Nomad’s included nvidia-gpu plugin. For other devices we can run tasks from a container or exec only if it has access to the GPU devices:

/dev/dri/* for Intel
/dev/kfd for AMD
/dev/nvidia* for NVidia

As /dev paths aren’t made available to exec chroot, we’ll fall back to raw_exec. Note it’s important to schedule the task as the user who will access the service. The socket /tmp/mawd.sock can be assigned permissions to others but the fuse filesystem (/mnt/gpu) will not be available to any other users including root. This setting can be changed at the system level but is highly not recommended.

Nomad job using the raw_exec driver to schedule a potentially large cluster of GPU services such as a render farm or crypto mining custer.

Conclusion

In this section we’ve gotten a basic introduction to Mawenzy and simplified how you can manage kernels and buffers. Note that Mawenzy isn’t 100% hardened and bulletproof. If you’ve had any errors or issues don’t despair. The most fun part of experimental software is when it surprises you. Have you customized any of the examples yet? Have you tried writing an infinite loop on your GPU? See what happens (at your own risk). Most OpenCL platforms have a timeout before they kick the hardware and sometimes even reboot your GPU, causing a brief flicker of the screen before returning to normal. I’ve crashed a few GPUs many times — sometimes on purpose. Driver implementation varies but I’ve never seen any damage.

Any of the examples in this section can be run on a CPU device for comparison. All you need to do is switch your environment variables for platform and/or device to a CPU and rerun MawD. The bad kernel with a single workitem in a for loop may run faster than a parallel wavefront kernel on some CPUs. If you write a specialized kernel and it performs better on a CPU than on a GPU at least you know it needs tuning or it should only be run on CPU devices.

Some existing kernels in the wild may easily be run on MawD with minor modifications. Later we’ll try to enqueue the ethash.cl kernels for mining crypto hashes currency directly from the source.

Ideally we can replace independent for loops with wavefronts. We can also replace nested for loops with wavefronts of multiple dimensions. While it’s not ideal to schedule a kernel with dimension of 1, also known as a task, it can sometimes be useful for setup and feedback. Now that we’ve covered the basics, the next sections will jump into some more fun use cases and utilities:

Introduce CLON, object notation to simplify readable input.
Introduce an image file package you can use to easily render image files and we’ll use it to create a basic ray tracer in under 100 lines of code.
Introduce an HTML package that allows you to serve web responses directly from the GPU.
Upgrade OpenCL 1.2->2.0. The Intel device supports OpenCL 2.0 which adds some fun features. Most importantly, 2.0 includes device-side commands.

PART 4: Use Cases - The Ugly