From real-time video and ray tracing to general-purpose compute beasts, we need to cover a brief history of the programmable pipeline and what options are available to use them before we dig into setting up a platform and running code.

All Things GPU: Part 2

Published in

HashiCorp Solutions Engineering Blog

13 min readNov 9, 2019

Intro to CUDA and OpenCL

Back when the GPU was built solely for graphics, hardware had a fixed pipeline. Rendering algorithms and frameworks like OpenGL needed to be supported in hardware and couldn’t be updated. It became apparent that GPUs needed to be more flexible and run a programmable pipeline. Instead of saying “given the polygons in my model, calculate the color for each of these million pixels” models became “given the polygons in my model, run my shader code for a million pixels.” This took a few forms like GLSL, or OpenGL’s shading language. A simple bit of code could take inputs like surface light vectors, 2D texture, and camera transformations to spit out a fragment or pixel details based on X,Y coordinates. This is still used in real-time animation and games to generate complex environments of lighting and shadows pixel by pixel.

# Source: https://learnopengl.com/Getting-started/Shaders
layout (location = 0) in vec3 aPos;
layout (location = 1) in vec3 aColor;
out vec3 ourColor;
void main()
{
  gl_Position = vec4(aPos, 1.0);
  ourColor = aColor;
}

This main code isn’t actually a single operation but a broadcast inside the GPU to as many cores as it can use simultaneously based on drivers and fragments/pixels. There are also shaders for computing vector attributes and complex meshes.

Classic WebGL with shaders to calculate interactive wave propagation, reflection, refraction, and caustics in real-time. These intensive calculations on a CPU would take a long time and much more energy. http://madebyevan.com/webgl-water/

The Demoscene makes big use of shaders in artistic demo contests. Different categories of demos compete for prizes each year at conferences like Revision. You can compete in the 4k or 64k challenges where you cram your most amazing artistic app with graphics and sound into maximum 4k or 64k binary. Often amazing short films can be crammed into 4k using just graphics shaders. Another great resource is to browse Shadertoy for some amazing examples of applications running purely on GPU.

Snapshot from Happy Jumping shader by Inigo Quilez. Motion blurred ray marching in just 499 lines of code.

The scientific and HPC community quickly seized this technique to render data sets and large analysis problems instead of games and pixels. If a GPU can instantly project vertices and textures into images on a screen, why can’t they project weather data and climate models into weather forecasts? Even further, why can’t they process thousands of images into machine learning models?

It may have started out as hacky use case but it quickly became apparent that a general purpose language on a flexible pipeline would be useful for exactly this purpose with or without a graphical output. CUDA was NVidia’s answer to this problem, first released in 2007, based on a subset of the C language.

// https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/
__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) 
    y[i] = a*x[i] + y[i];
}

This is a super simple kernel example called “saxpy” in CUDA C which simply does some arithmetic and one bit of logic on each work element. The tricky part here is that I haven’t included the host code (see the link in the code), which is the main setup on CPU to initialize appropriate buffers and is about 10 times longer than the kernel itself. CUDA C tends to require all code be compiled with the CUDA compiler, NVCC, which compiles and optimizes the CPU code as well as the kernel code that it will call. There is a JIT (Just in Time) compiler option for other languages but pre-compiled code is almost always faster. From the host code we can call the kernel with work dimensions:

saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

If I’m calculating these values for a 2-dimensional dataset (say an image), I can send a two-dimensional range to compute. Instead of a serial nested for loop over x,y I send a broadcast of x*y parallel threads to CUDA in what’s called a warp. I can also group parts of the warp to optimize memory that may need to be shared by each thread. Note that as this may or may not be in parallel, if any calculation is dependent on information calculated by another thread, I can’t guarantee the order of operations without synchronization. Synchronization is generally detrimental to performance as all active threads need to wait together.

Source: https://www.phoronix.com/scan.php?page=article&item=siggraph-2018-khr

OpenCL

By contrast, Open Computing Language was released by the Khronos Group in 2009 and generally relies on external host code which manages buffers and JIT compiles kernels at runtime. Instead of CUDA warps, OpenCL calls them wavefronts. Instead of threads, OpenCL uses work items and workgroups. It is usually JIT-compiled but offers an intermediate pre-compiled option for speed called SPIR-V. OpenCL drivers are available for many devices including GPUs and CPUs. The language is based on a subset of C99, with an extension called SYCL that implements C++11. SYCL or OpenCL C++ available in version 2.2+ is out of scope here (hopefully a later addition). Have a look at the above saxpy sample converted to OpenCL:

kernel void saxpy(int n, float a, global float* x, global float* y)
{
  const int i = get_global_id(0);
  if (i < n)
    y[i] += a*x[i] + y[i];
}

Anything performed via OpenCL requires a context. An OpenCL context is analogous to a container. It contains all of our buffers and command queues and can utilize one or more devices within a single platform. One of the tricky bits of OpenCL is context configuration. Also, a context can’t be shared between processes, making IPC tricky. Since it’s an open standard, you’ll often find vendor support has mixed experiences. You’ll also find examples of code that use different host languages — some samples are written in Python, some C, some C++, etc. All of these options require you to know two languages — one to perform host operation with setup on the CPU and one to actually execute on the GPU as OpenCL code. Also, most code is stateless consisting of the following steps:

1. Set up OpenCL context with selected device(s).
2. Create a command queue.
3. Load and compile a CL program.
4. Allocate and load required buffers.
5. Enqueue your kernel(s) from the program.
6. Read results.
7. Destroy buffers, queue, and context.

Most of the overhead is in repeated setup, buffer writing/reading, and destruction and negates the performance gain of the GPU and highlights one of OpenCL’s adoption pains. Take a look at this complex but basic vector addition example from Oak Ridge Labs: https://www.olcf.ornl.gov/tutorials/opencl-vector-addition/

Address Space

Note the special keywords on kernel parameters. Inside the kernel what looks like a pointer into GPU memory is actually a mapped pointer during the kernel execution. In fact, if you store a handle/pointer into a buffer for use in further kernels, the resulting pointer will be invalid since mappings get reassigned each run. Don’t store buffer pointers! Instead, you’d need to store simple offsets or use kernel pipes which are an advanced topic.

Constant address space is reserved for read-only memory and has super fast access since each Compute Unit can cache a local copy without worrying about synchronization issues. Unfortunately, constant address space on devices is fairly limited — about 64k in my devices. Most read/write buffers tend to be global which isn’t best for performance but helps with atomic operations and synchronization. You’ll see sometimes our OpenCL 1.2 code will need to define multiple copies of the same function since we need different address space options. OpenCL 2.0 added generics, which can automatically compile options for all address space. To keep things simple as you learn, use global address space until you get comfortable.

Architecture and Features

A GPU won’t run standard x86 obviously. Each has its own architecture and specifications, including endian and register width. If you exchange binary data to a device with an endian mismatch, hold off on publishing that dissertation. Also, certain devices handle different native types, such as fp16, fp64, and most importantly, each device may have its own alignment rules. Where you think you could just put a struct pointer on a buffer from the CPU, you may find that alignment causes fields to be misaligned — often with catastrophic results or segfaults. We’ll explore this all in depth in section 4: The Ugly.

Mixed architectures mean we may need to jury rig buffers to align and crunch that particle accelerator data.

Device Partitioning and CGroups

A context by default can utilize the entirety of devices allocated to it. What if you need to share without hogging all the resources? Device partitions or sub-devices introduced in OpenCL 1.2 allow you to request an upper limit on your context resources. This can be used to prevent someone claiming all your device(s) memory or compute units. This is especially useful if you’re using your primary display driver for long-running kernels. You may find by accident you schedule an infinite loop or long-running kernel that can’t be stopped with a simple SIGHUP. The GPU may not have enough time to update your 60hz refresh and instead will freeze the screen. Inevitably most drivers will reset the GPU after a predefined timeout to prevent locking up a system completely. The clinfo command can be used to find out how many partitions or sub-devices your device supports:

$ clinfo | egrep "Device Name|sub-devices"
 Device Name Intel(R) HD Graphics 
 Max number of sub-devices 0 
 Device Name Intel(R) Core(TM) i7–8550U CPU @ 1.80GHz 
 Max number of sub-devices 8

Device partitions will be further enhanced by a new kernel patch in Linux ~5.3+ introduced to allow GPU resource limitations using standard CGroups. This will be an instant win for container users and Nomad in particular which already uses limited CGroups for exec jobs in addition to containers.

Context Simplification

During my early days of learning OpenCL I found the overhead of the constant context build/destroy cycle frustrating and I created a platform to automate it completely. After all, the end goal is to run kernels on a context, not constantly build and tear down a context, and certainly not rewrite and recompile host code for every run. Also when nano seconds count, even the PCIe bus may not be fast enough to read and write buffers in time. Why do we need to read and write data to use it? Why can’t we just keep it on the GPU for later?

Now disassemble your context and continue!

A few years back I wrote an OpenCL application server called Mawenzy. I’m happy to share this platform now so that we can dive into GPU code together without worrying (or even flame warring) about various host languages. Since GPU devices don’t have an operating system or a familiar interrupt/event loop, it’s beneficial to have an application server persist and maintain context programs and buffers. I managed this via two primary mechanisms: a service socket and a buffer filesystem. Constantly recompiling and managing a host program in another language for context management is really inconvenient, so we’ll take of this automatically. If you have specialty requirements later you can then custom code your own niche python or JS or C/C++ loaders. Incidentally, if you’ve never written any OpenCL on your GPU before I would recommend building one of the examples above to understand how tedious it can be.

Setup

First thing we’ll need to do is set up OpenCL drivers for our platform. Intel, AMD, and NVidia all offer their own platform which can be tricky. If you’re on a Mac, I’m sorry to say that it seems Apple has thrown in the towel on OpenCL even though they were instrumental in its creation. That doesn’t mean OpenCL and OpenGL don’t still work perfectly on other platforms and can even be run from within VMs. I’d be curious to hear feedback from anybody on a Mac that can configure a context. In that case, I’d need to add OSX support.

In my case, I’m running Fedora Linux locally — Ubuntu or Enterprise Linux distributions work as well. You’ll need at least OpenCL version 1.2 properly configured for your distribution such that you can use the clinfo command to list your available platforms and devices. If you’re using cloud GPU instances you can usually find a pre-build AMI with CUDA or OpenCL ready to roll. Locally I’m running four platforms — NVidia’s CUDA OpenCL, AMD GPU, Intel CPU/GPU, and POCL open source CPU option which in my case is 24 compute units (24 logical, 2x6 physical Xeons). The AMD device supports OpenCL 2.0 but requires older LTS or EL Linux to work as drivers have been pretty rocky for years. I also have a laptop with NVidia and a NUC with Intel CPU+GPU platforms. The NVidia is a single GPU and the Intel platform offers both the Intel CPU and Intel GPU as separate devices. In order to share multiple devices in one context, they must be of the same platform driver set. For AMD support I’ll use the raymanfx/amdgpu-pro COPR repo until AMD can provide solid upstream support. Otherwise, AMD devices in the cloud will work better than locally.

A list of available OpenCL platforms on my dev workstation. My Quadro RTX 4000 has 37 Compute Units with 2304 cores — 64 per CU. The AMD card is a smaller Radeon Pro WX 2100 with 512 cores across 8 CU.

This isn’t a high-end environment by any means but it’s great for development and it’s pretty energy efficient. I always focus more on having the latest generation of hardware than the most power. High-end enterprise GPUs often consume more electricity than their entire host. In the case of a laptop embedded GPU, I may get all I need out of just a few watts, whereas using a discrete GPU can often drain a battery much quicker.

Remember the Intel GPU die had 192 cores in 24 compute units divided into 3 groups of 8? This is the core config of that device. Core configs can be found on Wikipedia for Intel, AMD, and NVidia. A compute unit usually has the ability to run just one kernel at a time no matter how many of its cores you use in parallel. I can submit a kernel with dimensions (NDRange) of 1, but an entire Compute Unit will be occupied until that task finishes even if it has 128 cores. Conversely, if I run a kernel with dimensions of 64x1 or 8x8, I may be able to use all 64 cores of one Quadro RTX 4000 Compute Unit in the same amount time. Also, groups of cores may have different shared caches that optimize parallel read-only or global access.

Whilst writing this I’m using an HP Z640 with both nVidia and AMD GPUs. I also use with a Skylake Intel NUC with 8th generation Intel graphics capable of OpenCL 2.0 using just a few watts of electricity. I’ll also be using cloud instances with high-performance Nvidia Tesla and AMD resources that allow flexible testing with low cost. AMD is a little slower on their driver releases and I’ve spent days troubleshooting kernel modules on current Fedora offerings so I’ll use cloud instances with supported RHEL releases for comparable AMD/NVidia use cases.

If you’re on a yum/dnf based distribution, the simple way to see which OpenCL vendors are available is to list all available Installable Client Driver (ICD) loaders available in your active repos. OpenCL searches the /etc/OpenCL/vendors/ directory for text files which list all available libraries supporting OpenCL devices. Installing any of these generally will pull in the OpenCL runtime as a dependency. Note some implementations are less than functional and can break the OpenCL platform subsystem when enabled. Best to disable all unnecessary ICD files (by moving them from /etc/OpenCL/vendors/) when you select a platform to work with. I move every unused platform from /etc/OpenCL/vendors/ to /etc/OpenCL/vendors/disabled/ which works fine.

$ sudo dnf provides '/etc/OpenCL/vendors/*'
beignet-1.3.2-5.fc29.x86_64 : Open source implementation of the OpenCL for Intel GPUs 
Repo        : @System 
Matched from: 
Filename    : /etc/OpenCL/vendors/intel-beignet.icd 
 
intel-opencl-r5.0-63503.x86_64 : OpenCL ICD loader and ICD for Intel Iris Graphics 
Repo        : @System 
Matched from: 
Filename    : /etc/OpenCL/vendors/intel.icd 
 
mesa-libOpenCL-19.0.2-3.fc30.i686 : Mesa OpenCL runtime library 
Repo        : fedora 
Matched from: 
Filename    : /etc/OpenCL/vendors/mesa.icd 
 
mesa-libOpenCL-19.0.2-3.fc30.x86_64 : Mesa OpenCL runtime library 
Repo        : fedora 
Matched from: 
Filename    : /etc/OpenCL/vendors/mesa.icd 
 
mesa-libOpenCL-19.1.8-1.fc30.i686 : Mesa OpenCL runtime library 
Repo        : updates 
Matched from: 
Filename    : /etc/OpenCL/vendors/mesa.icd 
 
mesa-libOpenCL-19.1.8-1.fc30.x86_64 : Mesa OpenCL runtime library 
Repo        : updates 
Matched from: 
Filename    : /etc/OpenCL/vendors/mesa.icd 
 
pocl-1.2-4.20190221gita0b083a1b47a738.fc30.i686 : Portable Computing Language - an OpenCL implementation 
Repo        : fedora 
Matched from: 
Filename    : /etc/OpenCL/vendors/pocl.icd 
 
pocl-1.2-4.20190221gita0b083a1b47a738.fc30.x86_64 : Portable Computing Language - an OpenCL implementation 
Repo        : @System 
Matched from: 
Filename    : /etc/OpenCL/vendors/pocl.icd 
 
pocl-1.2-4.20190221gita0b083a1b47a738.fc30.x86_64 : Portable Computing Language - an OpenCL implementation 
Repo        : fedora 
Matched from: 
Filename    : /etc/OpenCL/vendors/pocl.icd 
 
xorg-x11-drv-nvidia-340xx-cuda-1:340.107-4.fc30.i686 : CUDA libraries for xorg-x11-drv-nvidia-340xx 
Repo        : rpmfusion-nonfree 
Matched from: 
Filename    : /etc/OpenCL/vendors/nvidia.icd 
 
xorg-x11-drv-nvidia-340xx-cuda-1:340.107-4.fc30.x86_64 : CUDA libraries for xorg-x11-drv-nvidia-340xx 
Repo        : rpmfusion-nonfree 
Matched from: 
Filename    : /etc/OpenCL/vendors/nvidia.icd 
 
xorg-x11-drv-nvidia-390xx-cuda-3:390.116-2.fc30.x86_64 : CUDA driver for xorg-x11-drv-nvidia-390xx 
Repo        : rpmfusion-nonfree 
Matched from: 
Filename    : /etc/OpenCL/vendors/nvidia.icd 
 
xorg-x11-drv-nvidia-390xx-cuda-3:390.129-1.fc30.x86_64 : CUDA driver for xorg-x11-drv-nvidia-390xx 
Repo        : rpmfusion-nonfree-updates 
Matched from: 
Filename    : /etc/OpenCL/vendors/nvidia.icd 
 
xorg-x11-drv-nvidia-cuda-3:418.56-1.fc30.x86_64 : CUDA driver for xorg-x11-drv-nvidia 
Repo        : rpmfusion-nonfree 
Matched from: 
Filename    : /etc/OpenCL/vendors/nvidia.icd 
 
xorg-x11-drv-nvidia-cuda-3:430.40-1.fc30.x86_64 : CUDA driver for xorg-x11-drv-nvidia 
Repo        : rpmfusion-nonfree-updates 
Matched from: 
Filename    : /etc/OpenCL/vendors/nvidia.icd

Device-side Queuing

When OpenCL reached 2.0 it added a particularly juicy feature we’ll cover: device-side queuing. While it’s not an exciting conversation starter, to me device-side queuing blows open the GPU. Whereas previously you’ve needed a CPU to tell a GPU everything it should do, with OpenCL 2.0 a GPU can tell itself what to do. It can also do so without any input from the CPU, such that a properly crafted task sent to the GPU can trigger a cascade of subtasks from a single operation. If we’re going to rely solely on the GPU for complex tasks, this will be a key feature, but sadly we’ll have to leave behind devices that currently support only OpenCL 1.2 and probably will not be updated in favor of CUDA. All the same, we can accomplish quite a bit with OpenCL 1.2.

Enough talk, let’s get coding. In the next session we’ll start a Mawenzy service and write a few basic kernels to test it out with just a few lines of code. No need to worry about any platform code or management — we’ll just be writing kernels. Make sure your platform is configured with valid OpenCL libraries and ICDs according to your distribution if you want to follow along.

PART 3: Introduction to Mawenzy