Installing llama-cpp-python with NVIDIA GPU Acceleration on Windows: A Short Guide

Piyushbatra
3 min readNov 18, 2023
https://github.com/abetlen/llama-cpp-python

Are you a developer looking to harness the power of hardware-accelerated llama-cpp-python on Windows for local LLM developments? Look no further! In this guide, I’ll walk you through the step-by-step process, helping you avoid the pitfalls I encountered during my own installation journey.

Prerequisites:

  1. Install Visual Studio with:
  • C++ CMake tools for Windows.
  • C++ core features
  • Windows 10/11 SDK.
Visual Studio 2022 Enterprise with required components installed.

2. CUDA Toolkit:

  • Add CUDA_PATH ( C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2) to your environment variables.

Installation Steps:

Open a new command prompt and activate your Python environment (e.g., using conda). Run the following commands:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

# Use --verbose for extra assurance that cuBLAS is being used in compilation.

Add the --verbose option during installation if you want to ensure that CUDA is being used in compilation.

If CUDA is not configured correctly, llama-cpp-python will be installed without Hardware Acceleration.

If Cuda is detected but you get No CUDA toolset founderror do the following:

  • Copy files from: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\extras\visual_studio_integration\MSBuildExtensions
    to
    (For Enterprise version) C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations
    or
    (For Community version)C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations
copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\extras\visual_studio_integration\MSBuildExtensions" "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations"

(Adjust the paths based on your installation)

Testing

  • Verify the installation by running the following Python code:
from llama_cpp import Llama
llm = Llama(model_path="model.gguf", n_gpu_layers=30, n_ctx=3584, n_batch=521, verbose=True)
# adjust n_gpu_layers as per your GPU and model
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)
Using LLama2–7B-Chat with 30 layers offloaded to GPU

If the installation is correct, you’ll see a BLAS = 1 indicator in the model properties.

Conclusion:

By following these steps, you should have successfully installed llama-cpp-python with cuBLAS acceleration on your Windows machine. This guide aims to simplify the process and help you avoid the common pitfalls.

Now you’re ready to dive into local llama development with enhanced performance. Happy GPU Offloading!

--

--