Install llama-cpp-python with GPU Support
A walk through to install llama-cpp-python
package with GPU capability (CUBLAS) to load models easily on to the GPU.
If you are looking for a step-wise approach for installing the llama-cpp-python package, you are in the right place. This guide summarizes the steps required for installation.
Before we install, are you wondering why do we need to install this package separately with GPU capability ?
This packages gives us a class or interface (LlamaCPP) for to create a model instance or object, primarily for pre-trained LLM models.
By default, even though if you have a Nvidia GPU in your system and all the CUDA compilers and packages are installed, this package only installs CPU capability.
Installing with GPU capability enabled, eases the computation of LLMs (Larger Language Models) by automatically transferring the model on to GPU.
In this guide, I will provide the steps to install this package using cuBLAS (GPU-accelerated library) provided by Nvidia.
My System Configuration
- System — Azure VM
- OS — Ubuntu 20.04
- LLM model used — Mistral -7B
Pre-requisites
- Ensure Nvidia CUDA toolkit are installed, minimum required package version is 12.2
- Download required package from Nvidia official website by (https://developer.nvidia.com/cuda-12-2-0-download-archive) and install it.
- Verify successful installation of toolkit by using this command
nvidia-smi
command. This command should detect your GPU. - Also verify in the source folder by checking in the /usr/local/ directory, there should be cuda-12.2 directory created and inside all the required files will be created.
2. Install GCC and G++ compilers to compile and install packages
- Add gcc repository using below command.
- sudo add-apt-repository ppa:ubuntu-toolchain-r/test
- Install gcc and g++ compilers using command below.
- sudo apt install gcc-11 g++-11
- Update alternatives using below command to change default version 11
- sudo update-alternatives — install /usr/bin/gcc gcc /usr/bin/gcc-11 60 — slave /usr/bin/g++ g++ /usr/bin/g++-11
- Check the installed versions of GCC and G++ for correct installation.
- gcc — version # This should printout gcc version as 11.4.0
- g++ — version # This should printout gcc version as 11.4.0
3. pip install langchain cmake
Llama-CPP installation
· By default the LlamaCPP package tries to pickup the lowest cuda version available on the VM. If there are multiple CUDA versions, specific version needs to be mentioned.
· Use below command for installation of the package.
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.2 -DCUDAToolkit_ROOT=/usr/local/cuda-12.2 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.2/lib64" FORCE_CMAKE=1 pip install llama-cpp-python - no-cache-dir
Verifying the installation
Verify by creating an instance of LLM model by enabling verbose = True
parameter.
from langchain.llm import LlamaCpp
model = LlamaCpp(model_path, n_gpu_layers = -1, verbose = True)
n_gpu_layers = -1
is the main parameter that transfers the available computation layers onto GPU. Alternatively, you can set the number of layers you want to transfer, but -1 will automatically calculate and transfer them.
verbose = True
prints the models details and parameters
In the above image, you can see the GPU is detected (Tesla T4) and BLAS = 1 parameter. When the BLAS value is 1 it means that GPU capability is enbled and model is offloaded on the GPU.
Comparison
LlamaCPP with CPU
Time taken to load Mistral-7B model — 1 min (approx)
Time take to generate a response to query — 20 min (approx)
LlamaCPP with GPU
Time taken to load Mistral-7B model — 30 sec(approx)
Time take to generate a response to query — 30 sec (approx)
Conclusion
Based on the load time and response generation, we can clearly see the performance difference when we use llama-cpp-python
package with GPU support. Consider installing this package for better performance, if you have GPU/s attached to your system.