CUDA, ROCm, oneAPI? — Running Code on a GPU, Any GPU

Why knowing multiple vendor's GPU programming model is a necessary evil…or is it?

8 min readDec 27, 2022

Introduction

In my last two posts about parallel and accelerator programming, I talked about the basics of accelerator and parallel programming and some of the programming concepts required to ensure the correctness of your code. The interesting question for the programmer learning accelerator programming is “Do I really need to learn CUDA, ROCm and SYCL to program for NVIDIA, AMD, and Intel accelerators?”

When planning this series, I was expecting to dive into the similarity and differences between the hardware execution and memory models of each architecture. I was going to talk about warps, wavefronts and workgroups, the respective constructs CUDA, ROCm and SYCL use to group execution on hardware threads. However, the more I think about it, the more that seems a little too much for a blog post. If you are interested in those topics, see the nice SYCL for CUDA developers from Codeplay.

Instead today I’m going to focus on how to get you from no code to running on a GPU, any GPU, as quickly as possible. Nothing is as fun as seeing the performance benefits of your accelerator as quickly as possible. For me, this is fun because it coincides with the release of the Codeplay oneAPI for NVIDIA and oneAPI for AMD (beta).

SYCL and multiple vendor GPUs

SYCL is a programming model that has compiler support for NVIDIA, AMD and Intel GPUs. You can write your code in SYCL and then build and run it on those vendor GPUs. Previously this required using various community compilers, or building the Intel Open Source compiler, depending on your GPU target. With the release of the latest Codeplay toolchains, you can now quickly and easily get your code running through a single, prebuilt toolchain. For more details on this, you can read Ruyman Reyes, Codeplay’s CTO, blog on their new release.

One question you may be asking is can I actually build my code once and then decide at runtime which GPU to run on? The answer is YES! This is the nice thing about this support, you build your code once and at runtime, you can choose your target manually, or let the runtime library choose for you automatically.

Since you’ve gotten this far, I’ll assume this is of interest to you and my goal is to get you up and running on your GPU of choice. For my purposes, I’m testing all 3 vendor GPUs on Ubuntu 22.04 and be doing installing via APT.

Installing the Codeplay toolchain

First, we set up some basic system packages:

sudo apt update
sudo apt -y install cmake pkg-config build-essential

Next, we grab the Intel® oneAPI Base Toolkit 2023.0, which is used by the Codeplay toolchain. You can find Instructions for how to do this via APT at the Intel® oneAPI Toolkits Installation Guide for Linux* OS page. Here are the steps I used, but refer to the original page if you have issues:

# download the key to system keyring
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

# add signed entry to apt sources and configure the APT client to use Intel repository:
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list

sudo apt-update
sudo apt install intel-basekit

Vendor Specific Setup

Make sure you have the base drivers/frameworks you need for each GPU you have/want to target, again I’m linking Ubuntu instructions:

Intel GPU Driver install

Instructions for Intel GPUs on different OSes can be found here. Because I’m using an Intel Arc on Ubuntu 22.04, I followed these instructions.

oneAPI for NVIDIA Installation

You can go here to get the latest official instructions for how to install CUDA, I’m inlining what I use to get it up and running:

Install CUDA keyring, CUDA and reboot

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb

sudo apt-get update
sudo apt-get install cuda-11-7
sudo apt-get install nvidia-gds

sudo reboot

2. Set paths to make sure my terminal can find CUDA:

export PATH=/usr/local/cuda-11.7/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64\
                         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

3. Download the oneAPI for NVIDIA GPUs installer from Codeplay here.

4. Run the installation script

sh oneapi-for-nvidia-gpus-2023.0.0-linux.sh

5. Before each build in your terminal or in your .profile, make sure the build environment is set correctly to enable the oneAPI and to find CUDA.

. /opt/intel/oneapi/setvars.sh --include-intel-llvm
export PATH=/usr/local/cuda-11.7/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64\
                         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

oneAPI for AMD (beta) Installation

You can go here to get the latest official instructions for how to install ROCm, again inlining here to make things a little easier for you:

Install the AMD amdgpu-install program via APT

sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/5.4.1/ubuntu/jammy/amdgpu-install_5.4.50401-1_all.deb
sudo apt-get install ./amdgpu-install_5.4.50401-1_all.deb

2. Install the latest ROCm Driver using the amdgpu-install program

amdgpu-install --usecase="dkms,graphics,opencl,hip,hiplibsdk"

3. Download the oneAPI for AMD GPUs (beta) installer from Codeplay here.

4. Run the installation script

sh oneapi-for-amd-gpus-2023.0.0-linux.sh

5. Before each build in your terminal or in your .profile, make sure the build environment is set correctly to enable the oneAPI and to find ROCm.

. /opt/intel/oneapi/setvars.sh --include-intel-llvm

export PATH=/PATH_TO_ROCM_ROOT/bin:$PATH
export LD_LIBRARY_PATH=/PATH_TO_ROCM_ROOT/lib:$LD_LIBRARY_PATH

Sample Code and Supporting Multiple GPUs at once

There are a lot of different codes samples you can try in the open source, oneAPI samples GitHub repository:

GitHub - oneapi-src/oneAPI-samples: Samples for Intel oneAPI toolkits

The oneAPI-samples repository contains samples for the Intel® oneAPI Toolkits. The latest versions of code samples on…

github.com

From the Codeplay example you can see they created this simple-sycl-app.cpp:

#include <sycl/sycl.hpp>

int main() {
  // Creating buffer of 4 ints to be used inside the kernel code
  sycl::buffer<sycl::cl_int, 1> Buffer(4);

  // Creating SYCL queue
  sycl::queue Queue;

  // Size of index space for kernel
  sycl::range<1> NumOfWorkItems{Buffer.size()};

  // Submitting command group(work) to queue
  Queue.submit([&](sycl::handler &cgh) {
    // Getting write only access to the buffer on a device
    auto Accessor = Buffer.get_access<sycl::access::mode::write>(cgh);
    // Executing kernel
    cgh.parallel_for<class FillBuffer>(
        NumOfWorkItems, [=](sycl::id<1> WIid) {
          // Fill buffer with indexes
          Accessor[WIid] = (sycl::cl_int)WIid.get(0);
        });
  });

  // Getting read only access to the buffer on the host.
  // Implicit barrier waiting for queue to complete the work.
  const auto HostAccessor = Buffer.get_access<sycl::access::mode::read>();

  // Check the results
  bool MismatchFound = false;
  for (size_t I = 0; I < Buffer.size(); ++I) {
    if (HostAccessor[I] != I) {
      std::cout << "The result is incorrect for element: " << I
                << " , expected: " << I << " , got: " << HostAccessor[I]
                << std::endl;
      MismatchFound = true;
    }
  }

  if (!MismatchFound) {
    std::cout << "The results are correct!" << std::endl;
  }

  return MismatchFound;
}

Building for a specific GPU

Typically to build for our various targets we could use command lines that look like this:

NVIDIA:

clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda simple-sycl-app.cpp -o simple-sycl-app

AMD:

clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=<ARCH> simple-sycl-app.cpp -o simple-sycl-app

Intel:

clang++ -fsycl -fsycl-targets=spir64 simple-sycl-app.cpp -o simple-sycl-app

Then I would just run ./simple-sycl-app and assuming I have the right GPU in my system my application would run. Sure that’s great, but I could probably do the same thing which CUDA/ROCm by themselves using their vendor-specific programming models.

Building for multiple vendor GPUs

Let’s say I wanted to run for NVIDIA and Intel GPUs at the same time, I can actually create a fat binary that will run on either an NVIDIA or Intel GPU. The trick here is setting your build command line parameters appropriately to have multiple SYCL targets.

clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64 simple-sycl-app.cpp -o simple-sycl-app

You can see I simply set the -fsycl-targets to target both the NVIDIA and Intel GPU by separating my target with a comma. Now when I go to run my program I can do something like this:

tonym@LianLi-Linux:~/dev/2023.0Test$ clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64 simple-sycl-app.cpp -o simple-sycl-app
 
tonym@LianLi-Linux:~/dev/2023.0Test$ SYCL_DEVICE_FILTER=cuda SYCL_PI_TRACE=1 ./simple-sycl-app
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 11.15.1 ]
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE[all]:   device: NVIDIA GeForce RTX 3080 Ti
The results are correct!
tonym@LianLi-Linux:~/dev/2023.0Test$
 
tonym@LianLi-Linux:~/dev/2023.0Test$ SYCL_DEVICE_FILTER=ext_oneapi_level_zero:gpu:0 SYCL_PI_TRACE=1 ./simple-sycl-app
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_level_zero.so [ PluginVersion: 11.15.1 ]
SYCL_PI_TRACE[all]: Selected device: -> final score = 1550
SYCL_PI_TRACE[all]:   platform: Intel(R) Level-Zero
SYCL_PI_TRACE[all]:   device: Intel(R) Graphics [0x56a0]
The results are correct!

You can see that I can dynamically select my target using the SYCL_DEVICE_FILTER environment variable on the command line before running my program. Note that you can also do this programmatically as well. Here is some documentation about how to do that here:

SYCL Reference documentation

A device selector can be passed to , sycl::platform , and other constructors to control the selection of a device. A…

sycl.readthedocs.io

Future Work

Some astute readers may notice I did not include an AMD example in my multivendor GPU example. This happens to be because I recently replaced by AMD 6800XT GPU with a brand new AMD RX 7900XT GPU. Unfortunately, ROCm does not currently install properly on my Linux system regardless of the kernel I’m using with the RDNA3 card. Once this is updated, I will come back and provide instructions with an example running on all 3 cards.

Note that previously a version of this flow worked for the 6800XT, but that was before the Codeplay software was released using the open source DPC++ flow documented here:

oneAPI DPC++ Compiler documentation

Table of contents Alternatively, you can use a Docker image that has everything you need for building pre-installed…

intel.github.io

Conclusion

The oneAPI for NVIDIA GPUs from Codeplay allowed me to create binaries for NVIDIA or Intel GPUs easily. The time to set up the additional oneAPI for NVIDIA GPUs was about 10 minutes on top of the time to install the Intel oneAPI Base Toolkit. This included installing CUDA and the Codeplay software.

The new Codeplay software unlocks a brand new option for individuals and companies considering multivendor GPU options. Before the options were open source, community-supported compilers. This is a fantastic option and what I typically use for day-to-day work.

However for people considering longer-term support, aka businesses come to rely on toolchains for their livelihood, having an officially supported option from Codeplay and Intel can help provide peace of mind.

If you want to see what random tech news I’m reading, you can follow me on Twitter. Also, check out Code Together, an Intel podcast for developers that I host where we talk tech.

Tony is a Software Architect and Technical Evangelist at Intel. He was architect of Intel VTune Profiler and other performance tools and most recently led the software engineering team that built the data center platform which enabled Habana’s scalable MLPerf solution.