How to run LLAMA on Windows 11

Very quick guide

4 min readMar 16, 2023

Image of llamas displayed on monitors — Llamas generated by Stable Diffusion

UPD Dec. 2024: This article has become outdated at the time being. It had been written before Meta made the models open source, some things may work differently and don’t work at all.

Today, we’re going to run LLAMA 7B 4-bit text generation model (the smallest model optimised for low VRAM). I’ll try to be as brief as possible to get you up and running quickly. I’m assuming you have Nvidia graphic card with CUDA support, but it’s also possible to run on AMD cards and on CPU only (people have it running even on Raspberyy Pi), you just need specific PyTorch version. But enough words, let’s start.

Steps

Download and install Visual Studio Build Tools, we’ll need it to build 4-bit kernels PyTorch CUDA extensions written in C++. Select checkboxes as shown on the screenshoot below:

Visual Studio Build Tools installer window

Install miniconda
Go to Start menu and launch x64 Native Tools Command Prompt for VS

Now we need to enable conda in the opened command prompt window. For that, execute the following command:

powershell -ExecutionPolicy ByPass -NoExit -Command "& 'C:\\Users\\$USERNAME\\miniconda3\\shell\\condabin\\conda-hook.ps1' ; conda activate 'C:\\Users\\$USERNAME\\miniconda3' “

where $USERNAME is your Windows user name.

Go to the desired directory when you would like to run LLAMA, for example your user folder

cd C:\Users\$USERNAME

Create and activate conda environment

conda create -n llama
conda activate llama

Install git

conda install git

Install Cuda Toolkit 11.7

conda install cuda --channel nvidia/label/cuda-11.7.0b

Install PyTorch

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Clone Oobabooga text generation web UI git repository

git clone https://github.com/oobabooga/text-generation-webui.git

Go inside the cloned directory and create repositories folder

cd text-generation-webui
mkdir repositories
cd repositories

Clone GPTQ-for-LLaMa git repository, we’ll need it to run 4-bit model

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda

Now the hardest part. We’re going to build 4-bit CUDA kernels with C++ compiler. Here I run into tricky issue I was fighting quite much time. The thing is that after compiling sources we need to use linker to complete the build and it turned out that there are two link.exe files: the linker’s one and link.exe in miniconda enviroinment directory that emulates link Linux command. Keeping that in mind, we can just rename miniconda’s link.exe and avoid “substitution”. Go to the Library\usr\bin\ of llama env and rename link.exe to link2.exe as on the picture below

Renaming link.exe file in Windows Explorer

Now let’s build CUDA 4-bit kernels

cd GPT-for-LLAMA
$env:DISTUTILS_USE_SDK=1
python setup_cuda.py install

Go back to the text web ui root folder

cd ..\..

It’s time to get the model weights

python download-model.py --text-only decapoda-research/llama-7b-hf

We also need a pre-converted 4-bit model. Download llama-7b-4bit.pt file and place it into models folder. Here’s how it should look:

Downloaded model files in /models folder

Run the web UI

python server.py --gptq-bits 4 --model llama-7b-hf

Open UI in the browser by clicking on URL in the terminal window

Bonus step: run in chat mode

If you prefer ChatGPT like style, run the web UI with --chat or --cai-chat parameter:

python server.py --gptq-bits 4 --model llama-7b-hf --chat

Wrapping up

That’s it! Now you can dive in and explore bigger models and 8-bit models. You can also try AutoGPT instead of GPTQ-for-LLAMA. Please see links in the references section for further readings.

Happy coding!

How to run LLAMA on Windows 11

Very quick guide

Steps

Bonus step: run in chat mode

Wrapping up

References

Written by Vladimir Podolian