A Simple Guide to Enabling CUDA GPU Support for llama-cpp-python on Your OS or in Containers

Ryan Stewart
5 min readDec 31, 2023

--

A GPU can significantly speed up the process of training or using large-language models, but it can be challenging just getting an environment set up to use a GPU for training or inference — there are plenty of conversations on StackOverflow and other developer forums to attest to this.

If you want to learn how to enable the popular llama-cpp-python library to use your machine’s CUDA-capable GPU, you’ve come to the right place. Fortunately it is a very straightforward process once you filter throught the noise, which is exactly what I’ve done for you here. In this guide, I‘ll walk you through the specific steps required to enable GPU support for llama-cpp-python.

(The steps below assume you have a working python installation and are at least familiar with llama-cpp-python or already have llama-cpp-python working for CPU only).

Step 1: Download & Install the CUDA Toolkit

The first step in enabling GPU support for llama-cpp-python is to download and install the NVIDIA CUDA Toolkit. The CUDA Toolkit includes the drivers and software development kit (SDK) required to compile and run CUDA-accelerated applications.

You can download the CUDA Toolkit installer and find installation instructions for Windows and various, popular Linux distributions on Nvidia’s website: https://developer.nvidia.com/cuda-downloads.

Step 2: Use CUDA Toolkit to Recompile llama-cpp-python with CUDA Support

Once you have installed the CUDA Toolkit, the next step is to compile (or recompile) llama-cpp-python with CUDA support enabled. This involves setting the appropriate environment variables to point to your nvcc installation (which is installed with the CUDA Toolkit) and specifying the CUDA architecture(s) to compile for.

Here’s an example command to recompile llama-cpp-python with CUDA support enabled for all major CUDA architectures:

CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major" pip install llama-cpp-python --no-cache-dir --force-reinstall --upgrade

NOTE: For older versions of llama-cpp-python, you may need to use the version below instead. You DO NOT need to run both the command above and below.

CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir --force-reinstall --upgrade

In the example above, we are setting the CUDACXX environment variable to the path of the nvcc compiler executable included with the CUDA Toolkit. We are also setting the CMAKE_ARGS variable to specify that we want to enable CUDA support and compile for all major CUDA architectures.

Other valid values for CMAKE_CUDA_ARCHITECTURES are all (for all) or native to build for the host system's GPU architecture. Note that GPUs are usually not available while building a container image, so avoid using -DCMAKE_CUDA_ARCHITECTURES=native in a Dockerfile unless you know what you're doing.

Example Dockerfile

To make it easier to run llama-cpp-python with CUDA support and deploy applications that rely on it, you can build a Docker image that includes the necessary compile-time and runtime dependencies. Below I’ve provided a Dockerfile demonstrating the steps explained above; the Dockerfile builds a Docker image with llama-cpp-python compiled with CUDA support for all major CUDA architectures:

FROM python:3.10-bookworm

## Add your own requirements.txt if desired and uncomment the two lines below
# COPY ./requirements.txt .
# RUN pip install -r requirements.txt

## Install CUDA Toolkit (Includes drivers and SDK needed for building llama-cpp-python with CUDA support)
RUN apt-get update && apt-get install -y software-properties-common && \
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-debian12-12-3-local_12.3.1-545.23.08-1_amd64.deb && \
dpkg -i cuda-repo-debian12-12-3-local_12.3.1-545.23.08-1_amd64.deb && \
cp /var/cuda-repo-debian12-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/ && \
add-apt-repository contrib && \
apt-get update && \
apt-get -y install cuda-toolkit-12-3

## Install llama-cpp-python with CUDA Support (and jupyterlab)
RUN CUDACXX=/usr/local/cuda-12/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 \
pip install jupyterlab llama-cpp-python --no-cache-dir --force-reinstall --upgrade

WORKDIR /workspace

## Run jupyterlab on container startup
CMD ["jupyter", "lab", "--ip", "0.0.0.0", "--port", "8888", "--NotebookApp.token=''", "--NotebookApp.password=''", "--no-browser", "--allow-root"]

In this example, we use a Debian-based Python 3.10 image as our base image. We then install the CUDA Toolkit and compile and install llama-cpp-python with CUDA support (along with jupyterlab). Finally, we set our container’s default command to run JupyterLab when the container starts.

Alternatively, you can use the nvidia/cuda base image provided by Nvidia which simplifies the installation and results in a much smaller image. For example:

FROM nvidia/cuda:12.3.1-devel-ubuntu22.04

SHELL ["/bin/bash", "-c"]

## Add your own requirements.txt if desired and uncomment the two lines below
# COPY ./requirements.txt .
# RUN pip install -r requirements.txt

RUN apt-get update && apt-get install -y \
python3-dev python3-pip \
curl \
build-essential \
software-properties-common \
&& rm -rf /var/lib/apt/lists/*

## Install llama-cpp-python with CUDA Support (and jupyterlab)
CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major" pip install llama-cpp-python jupyterlab --no-cache-dir --force-reinstall --upgrade

WORKDIR /workspace

## Run jupyterlab on container startup
CMD ["jupyter", "lab", "--ip", "0.0.0.0", "--port", "8888", "--NotebookApp.token=''", "--NotebookApp.password=''", "--no-browser", "--allow-root"]

The Dockerfiles above can be used as-is or modified to suit your needs. If you are using a container to run your llama-cpp-python projects, don’t forget to ensure that your GPU is available to your containers at runtime! Otherwise, compiling with CUDA support will be for naught.

Conclusion

At this point, you should be set up to use llama-cpp-python with GPU on your host operating system or in containers. If you’re wondering what to do next, try downloading and using some GGUF models with llama-cpp-python as I’ve explained here — https://medium.com/predict/a-simple-comprehensive-guide-to-running-large-language-models-locally-on-cpu-and-or-gpu-using-c0c2a8483eee.

Let me know what kinds of cool projects you build using llama-cpp-python in the comments below. Happy New Year, and happy coding!

--

--