AMD, ROCM, PyTorch, and AI on Ubuntu: The Rules of the Jungle
Are you the awkward lanky kid trying to join the other cool athletes on the field? Or perhaps you went out to buy an NVIDIA video card, only to realize the 2GB VRAM won’t get you very far? And better yet “upgraded” to an AMD? And then when you got it home and then started hacking away on spacy and Stable Diffusion you realize that it’s a nightmare?
Yeah, that’s me. As I’m writing this there’s still a side project I’m trying to get working with Pytorch + Cupy. But let me untangle this mess. In this guide I’ll walk through what the stack looks like and the various components, in hopes that it can shed light on what you might be up against.
AMD ≠ Nvidia
Duh.
Men and women are different too.
But what I mean is that the underlying drivers are not the same.
In short, Nvidia uses uses CUDA, and AMD uses ROCM.
The current tech industry relies heavily on CUDA. Nvidia isn’t sharing their tech with AMD, so AMD is essentially creating a software layer that says 😉 to the kernel, “I really am CUDA 😉😉, trust me on this. 😉😉”
If you’re looking for a video card with easy integration with AI libraries, go with NVIDIA. Unfortunately AMD is lagging a bit behind.
Environment Variables
These variables are crucial to getting your environment working.
DRI_PRIME
This is like your “device selector.” If you have a discrete video card (meaning, you had to crack open your computer and install it yourself) this is likely 1. I’d recommend using your on-board video card for most day-to-day activities to leave your GPU’s full VRAM to do the heavy lifting. Typically with AI libraries they’ll try to detect a GPU over a CPU, but with some programs you may need to set this to give the program an extra nudge.
HSA_OVERRIDE_GFX_VERSION
Try saying that 10 times fast! But really, this flag has a couple of purposes. This environment variable provides a sort of “reverse mask.” Sometimes libraries may see your ROCM version (we’ll get to that in a second) as not compiled with HIP. This essentially tells the library: “No, really, you can use this.” Yow know you need to set this variable if (1) rocminfo
isn’t showing anything or (2) can see rocminfo
is finding the GPU but your program is not. (setting this will not always resolve #2). You can get this value with the rocminfo
command (see below)
ROCM_HOME
Tells AI libraries where ROCM is stored. On *nix systems this is typically somewhere in /opt
.
LD_LIBRARY_PATH
C Developers should be used to seeing this. But it’s used a lot when doing development with AI python libraries. Typically this is set to $ROCM_HOME/lib
. An indication you’re missing this flag is if you import pytorch and see an error like undefined reference to...
Command Line Utilities
amdgpu-install
amdgpu-install
[docs] This is the Linux command line utility for installing driver versions. Although they say in bold face letters
it’s still possible to install multiple versions of amdgpu-install. Just be careful about it.
What I like to do with amdgpu-install is set an alias:
ainst='sudo amdgpu-install -y --rocmrelease=${ROCMV} --usecase=rocm,rocmdevtools,lrt,hip,hiplibsdk,mllib,mlsdk,dkms'
Make sure it’s in single quotes (') so that ROCMV
isn’t expanded.
Now in order to install a specific release, just set ROCMV
to the rocm version number and run ainst
If you come across a bunch of Does not have a release file
errors, check /etc/apt/sources.list.d/admgpu-install
and /etc/apt/sources.list.d/rocm
and change the rocm version and OS code name.
And note that as of February 2023, you will be getting errors installing ROCM v. 5.0 or earlier due to legacy C libraries. So it’s best to stick with 5.3.0 and above if you want to do versioned installs.
rocminfo
rocminfo
is AMD’s equivalent to nvidia-smi
. It provides system info on what type of video card you have.
The Name
is something libraries may require you to set. For example, if you build cupy yourself, you’ll have to set HCC_AMDGPU_TARGET
. So for me…
$ rocminfo | grep 'Name:'
Name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Marketing Name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Vendor Name: CPU
Name: gfx803
Marketing Name: Radeon RX 580 Series
Vendor Name: AMD
Name: amdgcn-amd-amdhsa--gfx803
$ HCC_AMDGPU_TARGET='fgx803'
Also, note 803
translates to 8.0.3
, which I need to set HSA_OVERRIDE_GFX_VERSION
to. If you’re not seeing anything with rocminfo
, try unset
ting the variable.
Now, for the Common AI Libraries…
All libraries will try to either find CUDA or ROCM. By far, CUDA is the first priority when it comes to support. ROCM is often experimental, as in the case with CUPY (as of February 2023 the author [that’s me!] has gotten cupy to work with ROCM 5.4.3 by building from source).
Torch does fare a bit better. They provide instruction on how to install torch for ROCM (for Linux & mac, at least). They say they support ROCM 5.2, but I’ve been able to get Pytorch to work on 5.4 with no issue.
Note that if you run into any library issues (e.g. spacy), make sure to install pytorch + cupy then reinstall the libraries, i.e.
$ # oh nose! I need to reinstall spacy!
$ pip uninstall torch torchaudio torchvision spacy -y
$ pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.2 --no-cache-dir
$ pip install cupy # or install from source
$ pip install spacy
Again the installation order goes:
- ROCM
- torch / cupy
- libraries that rely on touch / cupy
End of the Quick Walk-through
Hopefully this provided a quick overview on how to get started on an AMD GPU. I wish I had a “this is the answer” kind of article for you, but getting AMD working requires a lot of monkeying and messing with in order to get things to work. Instructions for one version of AMDGPU on a certain project will fail for another AMDGPU version on another project. I hope I provided a framework that would help you deduce what might be wrong with your issue, though (or at least shorten the afternoon you’ll spend debugging).
And, yeah, AMD does need to step up their game. As of this writing I still can’t get a project to work with pytorch, even though pytorch is working perfectly fine with another project.
👉 Did you like this article? Share it with 3 of your friends or colleagues?
📢 Comment below: What’s your experience with AMD? Are you sick of it? Or sticking with it?
💓 Subscribe to DamnGoodTech on Ko-Fi. Help support more of these articles.