Compiling Llama.cpp for Nvidia on Pop!_OS

Duane Johnson
3 min readAug 28, 2023

--

Incredibly, running a local LLM (large language model) on just the CPU is possible with Llama.cpp!— however, it can be pretty slow. I get about 1 token every 2 seconds with a 34 billion parameter model on an 11th gen Intel framework laptop with 64GB of RAM.

I have an external Nvidia GPU connected to my Pop!_OS laptop, and I’ve used the following technique to successfully compile Llama.cpp to use clblast (a BLAS adapter library) to speed up various LLMs (such as codellama-34b.Q4_K_M.gguf). As a rough estimate, the speed-up I get is about 5x on my Nvidia 3080 TI.

Unfortunately, it’s difficult to use either Ubuntu’s native CUDA deb package (it’s out of date) as well as Nvidia’s Ubuntu-specific deb package (it’s out of sync with Pop’s Nvidia driver). So, we have to resort to compiling GPU-enabled code within a docker container. Here’s how.

First, follow System76’s instructions to set up docker with access to the GPU:

sudo apt install nvidia-container-toolkit docker.io git
sudo usermod -aG docker $USER
sudo kernelstub --add-options "systemd.unified_cgroup_hierarchy=0"

This changes the kernel to use cgroup v1, instead of cgroup v2; however, note that cgroup v1 has less stringent device security in some cases. See this issue for more details. As far as I understand, it is necessary still in 2023 (I’d love to be corrected).

Reboot. Then:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Now, let’s clone the Llama.cpp repository:

cd # i.e. go to your home dir, or another suitable location
git clone git@github.com:ggerganov/llama.cpp.git
# or: git clone https://github.com/ggerganov/llama.cpp.git

Next, (with a nod to Docker Hub’s hosting), let’s download and interactively enter Nvidia’s most recent Ubuntu container built specifically to help us access CUDA and other libraries for the GPU:

docker run -it --rm -v ~/llama.cpp:/llama \
--runtime=nvidia --gpus all \
nvidia/cuda:12.2.0-devel-ubuntu22.04 bash

Note that we’ve bound (-v) the $HOME/llama.cpp directory that we downloaded via git, above, to a directory (volume) INSIDE the container called /llama. This will allow us, later, to run the compiled llama.cpp main executable that we create, without having to be inside the docker container — we’ll be writing the compiled output right to the host directory.

Here’s the equivalent using podman 4.x:

podman run -it --rm -v ~/tools/llama.cpp:/llama \
--device nvidia.com/gpu=all \
docker.io/nvidia/cuda:12.2.0-devel-ubuntu22.04 bash

(If your version of podman is lower than 4.2, you can follow the Ubuntu install instructions here, and add cdi support for your nvidia GPU here)

Ok, now that we’re inside the docker container, let’s install all of our prerequisites, configure cmake, and build:

cd /llama
mkdir build && cd build
apt update && apt install cmake git libclblast-dev nvidia-opencl-dev
git config --global --add safe.directory /llama
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TESTS=0 -DLLAMA_CLBLAST=1 ..
cmake --build .

If your Nvidia GPU is plugged in, and the container sees the host system’s GPU, you should see a bunch of executable files created as a result of compilation. The most important of the bunch is the main binary, which you should see listed as it builds.

Now that this is built, you can exit the container (if you’d like) and run main from your regular shell:

cd llama.cpp
./build/bin/main -m models/codellama-34b.Q4_K_M.gguf
# LLM starts generating hallucinations

At this point, you probably won’t see too much of a speed boost, because we haven’t instructed the process to send any layers to the GPU (note the -ngl 28):

./build/bin/main -m models/codellama-34b.Q4_K_M.gguf -ngl 28

I had to play with that number (28) for a bit to see how many layers would fit in my 12GB of GPU RAM. With these 28 of 49 layers loaded, I see about a 5x improvement over CPU-only language generation.

May you never again have to re-compile anything inside docker. And good luck! :)

--

--