How to install LLAMA CPP with CUDA (on Windows)

4 min readDec 13, 2023

As LLM such as OpenAI GPT becomes very popular, many attempts have been done to install LLM in local environment.
The most famous LLM that we can install in local environment is indeed LLAMA models. However running LLMs requires lots of computing power even when just generating texts. Therefore we need GPUs to boost up the speed of generating.

Recently C/C++ port of LLAMA model has been developed. Since it is written in C/C++ language which is high-performance programming language, it could be running faster than ChatGPT with high-performance computing platform.

Although I don’t have such a high-performance computing platform, I tried to install some LLAMA cpp models with GPU enables.

Zephyr 7B

It is fine-tuned version of LLAMA and It shows great performance on Extraction, Coding, STEM, and Writing compare to other LLAMA models.
LLAMA cpp team introduced a new format called GGUF for cpp models.
Below repo contains model of GGUF format and I used this model to install.

https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF

To use LLAMA cpp, llama-cpp-python package should be installed.
But to use GPU, we must set environment variable first.
Make sure that there is no space,“”, or ‘’ when set environment variable.

Since I use anaconda, run below codes to install llama-cpp-python.

# on anaconda prompt!
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
pip install llama-cpp-python

# if you somehow fail and need to re-install run below codes.
# it ignore files that downloaded previously and re-install with new files.
pip install llama-cpp-python  --upgrade --force-reinstall --no-cache-dir --verbose

Running above code actually showed no errors, but you have to check if it is installed properly.
When you run the model actually (with verbose True option), you can observe logs like below, and BLAS must be set as 1. Otherwise LLAMA model would not use GPU.

C/C++ related features are installed by CMAKE

I had troubles for a while installing package.

The first trouble was CMAKE ignores the environment variable that I set.
You may would encounters same issue that I had, carefully set environment variable until compile does not ignore it.

The second trouble I had was that ‘CUDA toolset’ was not identified.

I solved this issue like below.
1. copy all the four files from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\extras\visual_studio_integration\MSBuildExtensions

2. past them to
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations

3. and re-install package using pip as above.

Overall instruction can be found below.

GitHub - abetlen/llama-cpp-python: Python bindings for llama.cpp

Python bindings for llama.cpp. Contribute to abetlen/llama-cpp-python development by creating an account on GitHub.

github.com

Results

With argument of n_gpu_layers, we can set how many layers would be on GPU. I set the argument as 50 which is more than existing layers and offload all layers to GPU as below.

But it is not magically fast.. I would check what I should do to boost up.

How to install LLAMA CPP with CUDA (on Windows)

Zephyr 7B

GitHub - abetlen/llama-cpp-python: Python bindings for llama.cpp

Python bindings for llama.cpp. Contribute to abetlen/llama-cpp-python development by creating an account on GitHub.

Results

Written by Kaizin