Ditching CUDA for AMD ROCm for more accessible LLM training and inference.

Rafael Manzano Masapanta
11 min readOct 31, 2023

--

I’m writing this entry mostly as a reference for those looking to train a LLM locally and have a Ryzen processor in their laptop (with integrated graphics), obviously it’s not the optimal environment, but it can work as a last resort or for at-home experimenting.

In this initial entry, we’ll discuss ROCm, AMD’s response to CUDA, which has been in development over the years; NVIDIA’s software stack is so well-known that until recently, it seemed to be the only viable option for model training and inference within local environments.

I mention local environments because an ‘alternative’ would be to use TPUs provided by Google in their cloud. However, Google barely supports the use of PyTorch when using their ASICs, issue which they’ve started to address recently. Even so, the most sane option in that case would be resorting to TensorFlow (RIP).

I found the development of ROCm interesting for a few years, as the compatible hardware was affordable. However, the technology wasn’t mature enough. And it’s not just my opinion; these are some entries on, for example, YCombinator, which don’t look so good:

People really don’t like ROCm, and with a reason.

Obtaining decent performance with NVIDIA hardware requires a fairly significant investment. Of course, with all that, we will obtain a wide range of advantages, such as hardly having to worry about environment configuration, complete integration in Windows, or being able to bypass the requirement of having a CPU and motherboard compatible with PCIe Atomics.

Seeking Alternatives

One should not be mistaken; even if it seems that AMD’s GPGPU offerings give us more wiggle room, it’s not entirely the case. Unlike NVIDIA, which enables CUDA on almost all its graphics cards, ROCm only supports a limited selection of their current GPUs. According to them, the supported hardware at this moment includes the following chips:

https://rocm.docs.amd.com/en/docs-5.7.1/release/gpu_os_support.html

As seen earlier, the minimum requirement for ROCm, according to AMD, is the gfx906 platform, sold under the commercial name AMD Instinct MI50. This GPU provides 13.3 TFLOPs in FP32 operations and 6.6 TFLOPs in FP64 operations.

At the time of writing, it represents a minimum investment of about 600 dollars.

eBay listing. Specifications courtesy of TechPowerUp

While in a professional setting, they may seem like a tempting option, for the budgets of students, researchers, enthusiasts, and my own, they are quite out of range, budget-wise.

Not just because of the price of the card itself, but also due to its power consumption and the platform required to make it work, as we’ll need a motherboard and CPU that support PCIe Atomics; this results in the need for a relatively recent processor (Most x86 platforms after 2017).

Seeking (more) alternatives

After investigating a bit, there was something that didn’t quite fit: the last GPU listed on the official site was not supported, despite being marketed as a GPGPU oriented towards AI/ML:

The AMD Instinct MI25, with 32GB of HBM2 VRAM, was a consumer chip repurposed for computational environments, marketed at the time under the names AMD Vega 56/64. It was a relative success due to its raw computing power, offering around 12.5 TFLOPs in single precision. Despite these numbers, it falls short in double precision, providing about 768 GFLOPs, a ninth of what its more capable sister, the MI50, offers.

Apparently, in earlier versions of ROCm, this MI25 had support, but as of ROCm 5.0, the support ended. Or at least, in AMD’s case, ‘Not supported’ means ‘It will not receive any kind of support from us,’ and it doesn’t imply complete incompatibility with ROCm. The card works with some tricks that will be explained later and can also be extrapolated.

At the time of writing, the AMD Instinct MI25 represents a minimum investment of 70 dollars + import taxes:

Is this chip a reliable option for model training and inference?

Absolutely not.

However, in case this alternative fails us, we don’t end up with a paperweight; lacking internal firmware controls, we can flash the BIOS of an AMD Vega 64 ($150), the consumer version.

This consumer version supports DirectX, OpenGL, and Vulkan, and we can end up with a fairly competent graphics card if we work intensively in 3D with software like SketchUp, or donate computing power with Folding@Home, for example.

Or at least, that would be the case if we proceeded with this card.

Seeking an alternative for the alternative:

Knowing that ROCm 5.7 is compatible with older GPUs with a bit of tinkering, the most logical option would be to use the MI25. But, reading the title of the post, it’s easy to guess the direction we’re going to take and the hardware we’ll use this time.

In this scenario, we’ll use an APU because most laptops with a Ryzen CPU include an iGPU; specifically, this post should work with iGPUs based on the “GCN 5.0” architecture, or “Vega” for friends. We’ll use an AMD Ryzen 2200G in this post, an entry-level processor equipped with 4C/4T and an integrated GPU.

This integrated GPU, the Vega 8, can provide us with 1.8 TFLOPs in single precision (FP32) when operating at a frequency of 1.6 GHz in the core.

Quite capable for what it is.

At the time of writing, it requires an investment of 40 dollars (don’t):

What we are going to try is:

  1. Make ROCm detect our iGPU.
  2. Install the latest version of ROCm and get it up and running.
  3. Run PyTorch.

To start, being an iGPU, it takes a portion of our RAM to use as its own VRAM, the amount depends on how much we allocate to it. In our case, we will assign the maximum limit, which is 2GB; some motherboards may allow us to allocate more.

Now, we will install Linux. Specifically Ubuntu 22.04 for this test, which is officially supported by AMD, although theoretically, it should work with other distros as well, with some adjustments to repositories and installing the appropriate dependencies.

Let’s get to work.

Installing ROCm:

First, we’ll install ROCm. We head to AMD’s official documentation (https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html), where we need to add keys, repositories, and then we can install ROCm with apt.

Easy.

We run the following commands: ‘sudo rocminfo’ and ‘sudo openclinfo’ to confirm that ROCm detects our graphics card, and surprisingly, it detects it.

Not so fast though.

We (attempt to) use PyTorch:

Windows ROCm builds are always late.

If we try to install PyTorch using its official method, everything seems correct until it comes time to perform operations.

>>> import torch
>>> print(torch.cuda.is_available())
False

Why?

While ROCm detects our APU and its iGPU, the limitation in this case lies with PyTorch. For it to recognize our graphics card through ROCm, we must recompile it ourselves with new parameters. Officially, ROCm 5.0+ doesn’t support the gfx900 platform, but earlier versions like ROCm 3.0 did have support.

The PyTorch repository still retains compilation instructions supporting “gfx900” or AMD Instinct MI25, officially unsupported; this isn’t our iGPU, but both accelerators share architecture. We have to recompile PyTorch from the source code, defining new parameters to make it compatible with our Vega 8.

Installing extra libraries and preparing for compilation.

Turns out, if we have to compile PyTorch, our initial installation of ROCm and build-essentials are not sufficient. We need to install a plethora of additional libraries and dependencies that unfortunately are not listed in the documentation and are not included in AMD’s AMDGPU-install script when using the ‘ — usecase’ option with all possible parameters. We turn to the Makefiles to obtain the libraries they will ask for.

We have to acquire the dependencies ourselves; fortunately, many of them are in the AMD repository and can be easily installed using the package manager.

  1. Dependencies related to the compiling process
sudo apt install python3 python3-pip gcc g++ libatomic1 make \ 
cmake doxygen graphviz texlive-full

2. Dependencies that may not be included in the AMDGPU installer:

sudo apt libstdc++-12-dev rock-dkms rocm-dev rocm-libs 
miopen-hip rccl rocthrust hipcub roctracer-dev cmake

After this, we need to install Magma, which we’ll also compile. However, this step requires installing conda, so we start from there.

Installing Miniconda/Conda:

If you don’t have Conda/Miniconda installed yet, just run this script, and it will install it in our user directory.

Keep note of the installation path in case you change it, as we’ll use it later:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

We initialize the Miniconda environment with:

~/miniconda3/bin/conda init bash

Then we reload the terminal.

Compiling and installing Magma:

To facilitate the Magma installation process, we can use a script. However, we need some information beforehand:

  1. The LLVM target for a GPU similar to yours that is supported by ROCm.
  2. The installation path of Conda/Miniconda.

Obtaining your LLVM target is straightforward. You can search on the page https://llvm.org/docs/AMDGPUUsage.html for your model or run ‘sudo rocminfo’ and navigate to the GPU section. For example, my Vega 8 corresponds to ‘gfx902’, and the closest one is ‘gfx900’, which actually corresponds to the MI25, not the Vega 8.

However, if I were to use ‘gfx902’ here, it would fail because no implementation was made for this chip.

Once you have acquired this information, modify the following script with the values of PYTORCH_ROCM_ARCH and MKLROOT according to our LLVM Target and the Conda installation path, respectively, and then execute it:

export PYTORCH_ROCM_ARCH=gfx900

# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git
pushd magma
# Fixes memory leaks of magma found while executing linalg UTs
git checkout 5959b8783e45f1809812ed96ae762f38ee701972
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc
done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/home/rafaelmanzano/miniconda3
make testing/testing_dgemm -j $(nproc) MKLROOT=/home/rafaelmanzano/miniconda3
popd
sudo mv magma /opt/rocm

Compiling our PyTorch:

Now, to compile PyTorch, in addition to cloning it, updating the sub-repositories, and satisfying the requirements.txt, we need some additional pip packages to avoid problems during the compilation. The preparation process would look like this:

git clone https://github.com/pytorch/pytorch.git  
cd pytorch
git submodule update --init --recursive

sudo pip3 install -r requirements.txt

sudo pip3 install enum34 numpy pyyaml setuptools typing cffi future
hypothesis typing_extensions CppHeaderParser argparse

After this, we will have gathered the dependencies for the project itself. We proceed to run a script included in the project, which will transform the CUDA code from series into HIP C++ code compatible with our AMD iGPU.

sudo python3 tools/amd_build/build_amd.py

The final step remains, in which we will build our PyTorch for our architecture.

Here we use the ‘gfx900’ LLVM target once again.

Modify the MAX_JOBS according to the resources of our system to avoid hanging in the middle of the compilation; we will use 50–80% of our processor’s cores to be safe.

sudo PYTORCH_ROCM_ARCH=gfx900 USE_ROCM=1 MAX_JOBS=4 python3 setup.py install

Notes:

- Compilation takes several hours and doesn’t necessarily have to take place on the target PC, as long as you have the dependencies. For example, I am compiling on a Google Cloud VM.

Moving the pre-compiled result to the actual computer.

Final touch-ups and testing with our PyTorch:

To ensure PyTorch runs in a stable manner, we need to modify power-saving policies of the AMDGPU driver.

First, we’ll check the configuration values of the driver with:

find /sys/module/amdgpu/parameters/ -type f -name '*' -exec sh -c 'filename=${1%.*}; echo "File: ${filename##*/}"; cat "$1"' sh {} \;

We will get a fairly large list, we can ignore most of these, we are only interested in the ‘ppfeaturemask’ section.

We search for it and check its value. We change its value to 0xfff73fff with the following command:

sudo modprobe amdgpu ppfeaturemask=0xfff73fff

Finally, we restart and we are ready to test some PyTorch examples.

git clone https://github.com/pytorch/examples.git
cd examples/mnist
sudo pip3 install -r requirements.txt

: 'We use an override to make PyTorch use the code intended
for the Radeon Instinct MI25 (9.0.0) with our Vega 8
and finally, we run our test.
'

sudo HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py --verbose

We got it 👏

I recommend setting “HSA_OVERRIDE_GFX_VERSION=9.0.0” permanently to ensure PyTorch recognizes the iGPU even in the command line.

ROCm on the Python commandline

Unsurprisingly, it’s hard to determine the stability of this system, but I wouldn’t push it too hard, especially due to the 2GB VRAM…

In a few days, I will cover DeepSparse, developed by Neural Magic, which will allow us to perform inference using only the CPU in a way that’s 5 times faster than the current method with ONNX.

As it doesn’t require a GPU, we can integrate DeepSparse within AWS Lambda functions or any other serverless solution that only relies on a processor.

Additionally, it will be interesting to compare the cost between using DeepSparse on Lambda/Function instances and using CUDA on a specialized computing VM which includes a GPU, just to see which solution is more cost-effective.

Until next time! 👋

--

--