Install llama-cpp-python with GPU Support

3 min readMar 28, 2024

A walk through to install llama-cpp-python package with GPU capability (CUBLAS) to load models easily on to the GPU.

If you are looking for a step-wise approach for installing the llama-cpp-python package, you are in the right place. This guide summarizes the steps required for installation.

Before we install, are you wondering why do we need to install this package separately with GPU capability ?

This packages gives us a class or interface (LlamaCPP) for to create a model instance or object, primarily for pre-trained LLM models.

By default, even though if you have a Nvidia GPU in your system and all the CUDA compilers and packages are installed, this package only installs CPU capability.

Installing with GPU capability enabled, eases the computation of LLMs (Larger Language Models) by automatically transferring the model on to GPU.

In this guide, I will provide the steps to install this package using cuBLAS (GPU-accelerated library) provided by Nvidia.

My System Configuration

System — Azure VM
OS — Ubuntu 20.04
LLM model used — Mistral -7B

Pre-requisites

Ensure Nvidia CUDA toolkit are installed, minimum required package version is 12.2

Download required package from Nvidia official website by (https://developer.nvidia.com/cuda-12-2-0-download-archive) and install it.
Verify successful installation of toolkit by using this command nvidia-smi command. This command should detect your GPU.
Also verify in the source folder by checking in the /usr/local/ directory, there should be cuda-12.2 directory created and inside all the required files will be created.

2. Install GCC and G++ compilers to compile and install packages

Add gcc repository using below command.
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
Install gcc and g++ compilers using command below.
sudo apt install gcc-11 g++-11
Update alternatives using below command to change default version 11
sudo update-alternatives — install /usr/bin/gcc gcc /usr/bin/gcc-11 60 — slave /usr/bin/g++ g++ /usr/bin/g++-11
Check the installed versions of GCC and G++ for correct installation.
gcc — version # This should printout gcc version as 11.4.0
g++ — version # This should printout gcc version as 11.4.0

3. pip install langchain cmake

Llama-CPP installation

· By default the LlamaCPP package tries to pickup the lowest cuda version available on the VM. If there are multiple CUDA versions, specific version needs to be mentioned.

· Use below command for installation of the package.

CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.2 -DCUDAToolkit_ROOT=/usr/local/cuda-12.2 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.2/lib64" FORCE_CMAKE=1 pip install llama-cpp-python - no-cache-dir

Verifying the installation

Verify by creating an instance of LLM model by enabling verbose = True parameter.

from langchain.llm import LlamaCpp

model = LlamaCpp(model_path, n_gpu_layers = -1, verbose = True)

n_gpu_layers = -1 is the main parameter that transfers the available computation layers onto GPU. Alternatively, you can set the number of layers you want to transfer, but -1 will automatically calculate and transfer them.

verbose = True prints the models details and parameters

In the above image, you can see the GPU is detected (Tesla T4) and BLAS = 1 parameter. When the BLAS value is 1 it means that GPU capability is enbled and model is offloaded on the GPU.

Comparison

LlamaCPP with CPU

Time taken to load Mistral-7B model — 1 min (approx)

Time take to generate a response to query — 20 min (approx)

LlamaCPP with GPU

Time taken to load Mistral-7B model — 30 sec(approx)

Time take to generate a response to query — 30 sec (approx)

Conclusion

Based on the load time and response generation, we can clearly see the performance difference when we use llama-cpp-python package with GPU support. Consider installing this package for better performance, if you have GPU/s attached to your system.