One of Nomad’s most popular features is its NVidia GPU support. As Nomad can be used to schedule exec and raw-exec jobs, did you know it can also be used to run services that take advantage of any GPU or hardware device? What kind of workloads would you deploy? If you’re not sure, or you have no GPU plans for your development, you may be missing out. In fact, as I was writing this series, Intel just announced a plan to shift away from CPU focus onto a new CPU/GPU hybrid future they call the XPU. Are you ready for the future of hybrid parallel compute?
This begins a six-part series on how anybody can get started with GPU development using simple tools. We will experiment with computing via tying your hands behind your back. What could you accomplish if all you have is a GPU? We will maroon ourselves on GPU island. It turns out you can do quite a lot more than just crypto hashes, discrete cosine transforms, and pixels. The ideal audience for this series is someone who has at least some experience of programming but is fairly new to GPUs. Seasoned GPU and HPC experts will balk at the use cases we try, including suboptimal examples and what shouldn’t generally be done with a GPU. If GPU isn’t part of your application strategy, it should be. We will demonstrate where they can help and where they can’t. At one point, we’ll even crash a GPU on purpose to see it recovers, in efforts to better understand how things work. We’ll explore optimization and the important energy metric of compute performance per watt.
While combining OpenCL with graphics applications using OpenGL can yield some pretty amazing results, that is out of scope here may be covered later.
History of GPU to General Purpose or GPGPU
In honour of Intel’s upcoming re-entry into the discrete GPU space, I’d like to do a series of posts on my favourite topic and what is likely your most underused resource on-prem and in the cloud: the wonderful GPU. This is a reminder that nVidia and AMD make great GPU products but they aren’t the only players in a market dominated by the big three: Intel, nVidia, and AMD. This series of posts will show at an introduction level how easily you can utilize your GPU whether they’re enterprise cloud instances, local servers, or even your laptop or mobile device with embedded graphics. Together we will walk through some example problems you never thought your GPU could handle as well as some horrifying but fun examples that illustrate the good, bad, and ugly of GPU performance. All of our benchmarks will be run using HashiCorp Nomad which will allow us to simultaneously schedule tests on local Intel CPU and GPU as well as cloud AMD and Nvidia devices. We will implement everything from text analysis to data sorting and even a basic ray tracer.
When thinking of GPU market share, you may wonder if it’s dominated by AMD or NVidia but it may surprise you that simple embedded graphics have dominated markets for years. Even recent trends in cryptocurrency have barely altered markets while several devices have been installed in some systems. While not as powerful, embedded devices and APU combinations can accomplish quite a bit in an efficient low energy footprint.
When I’m not out enabling our excellent global partner network and community abroad, I like to use some of my free time on hobby projects. For me, the best hobby project is anything GPU related. Since the release of OpenGL 1.1, I’ve been fascinated by GPUs and how to maximize computing performance with limited resources. My favourite thing in the world is to approach any problem with a “GPU-first” mindset and think “can this be enhanced by using my GPU?” The good news is you can actually offload many daily CPU tasks onto the GPU. The caveat is that it isn’t always the best move. In this series, I will demonstrate use cases where GPU acceleration makes sense and where not through examples. The good news is that OpenCL can expose similar GPGPU protocols on any GPU — including your embedded chips on your laptop or advanced nVidia or AMD GPU discrete devices.
This blog series will be broken into six parts starting with this one. The first three sections are history and setup. There is normally quite a bit of setup required to get started with OpenCL. We’re going to simplify all of that so you can get up and running quickly. The final three sections will be example use cases and GPU code in ascending degree of effectiveness ranging from the “don’t try this at home” to the “I didn’t know a GPU could do that” to finally “effective GPGPU strategy.”
- History of the GPU and progression to General Purpose or GPGPU.
- Intro to CUDA and OpenCL.
- Intro to Mawsh: Simplified GPU Application Server.
- The Ugly: Don’t Try This at Home.
- The Bad: A GPU can do that, but should it?
- The Good: Great fit GPU Applications.
Fixed pipeline to CUDA and OpenCL.
In the early days, GPU devices were hardcoded to render graphics as fast as possible — usually for games. GPU devices did this very well. If you look back to the Nintendo Entertainment System (NES) it managed to render NTSC full-screen full-motion graphics with a single 8 bit CPU at 1.66 Mhz. Today’s mobile phone (2019) is thousands of times faster than that, has 4 or 8 cores, and uses a 64-bit architecture. So why does it sometimes struggle to play full-screen video? The answer is that MHz is not the only measure of performance in a system. Even the NES had a custom GPU co-processor (at the time called a Picture Processing Unit) capable of displaying much more than the CPU alone. How does this work?
A CPU usually has a few robust general execution cores while a GPU often has thousands of cores in the same area of a microchip. How does that work? It turns out GPU cores are very much simplified and share quite a bit of resources which enables them to take up less space but comes with architecture caveats. GPU cores also aren’t designed to run continuously — instead, they act as co-processors waiting for tasks from the CPU. When using discrete graphics on a dedicated card, the GPU often has access to high-speed DDR5 or HBM memory that is many times faster than the PCI-e bus and even main memory. HBM4 is designed for 4TB/s data rates with performance that can only be realized from within the GPU itself. For comparison, DDR3 in main memory typically maxes out at about 10–25GB/s.
The case of the ALU and the Potato
Without getting too far into microprocessor architecture, the Arithmetic and Logic Unit is core to most processors. A basic instruction set is built with integer arithmetic operations and logic branches. If a condition X is true, apply arithmetic Y to registers A and B. This is the classic base of algorithms no matter what language you’re writing. In order to optimize arithmetic, you need to minimize branching since logic complicates execution pipelines. CPU advancement has covered leaps and bounds to minimize this logic bottleneck, with speculative execution and branch optimization resulting in millions of transistors added to designs. Old CPU designs had a simple pipeline where branches and logic had no choice but to slow down execution. The study of algorithm optimization was paramount for developers to achieve any performance.
When developing on a GPU, you need to keep your algorithm thinking cap on. GPUs have hundreds or thousands of execution cores, but they don’t have the luxury of logic optimization. They are optimized towards arithmetic much more than logic in the name of parallel gains. You can broadcast a thousand arithmetic or floating-point operations across GPU cores at once where the CPU can tackle one, but if you introduce too many if branches or loops into a GPU, performance will struggle and require optimization.
Imagine you’re a cook and you need to cut potatoes into chips/fries. You can use a really fast knife to cut individual chips and check for bad or rotten spots along the way, cutting them out. On the other hand you could use a bulk chip cutter to cut everything at once, but every chip will be cut uniformly and if you encounter a bad or rotten part of the potato, there’s nothing you can do.
So with this classic dichotomy, what should we do with a GPU? NVida has given us CUDA which is a fantastic language to develop GPU code exclusively on their technology, but the rest of the industry developed OpenCL as an open source alternative. OpenCL works across GPU vendors and can even offer the CPU itself as an option if you wish to compare. In the case of OpenCL and CUDA, a device’s potatoes are memory buffers and the blades or slicing algorithms are bits of code called kernels. Kernels are kind of like parallel microservices for the GPU. They often aren’t more than a few lines but they can be very powerful.
In this series of posts I will focus on OpenCL so that anyone with a laptop, a cloud GPU, or even a mobile GPU can set up a development environment to test out our series of problems.