How to run LLAMA on Windows 11

Very quick guide

Vladimir Podolian
4 min readMar 16, 2023
Image of llamas displayed on monitors
Llamas generated by Stable Diffusion

UPD Dec. 2024: This article has become outdated at the time being. It had been written before Meta made the models open source, some things may work differently and don’t work at all.

Today, we’re going to run LLAMA 7B 4-bit text generation model (the smallest model optimised for low VRAM). I’ll try to be as brief as possible to get you up and running quickly. I’m assuming you have Nvidia graphic card with CUDA support, but it’s also possible to run on AMD cards and on CPU only (people have it running even on Raspberyy Pi), you just need specific PyTorch version. But enough words, let’s start.

Steps

  • Download and install Visual Studio Build Tools, we’ll need it to build 4-bit kernels PyTorch CUDA extensions written in C++. Select checkboxes as shown on the screenshoot below:
Visual Studio Build Tools installer window
  • Install miniconda
  • Go to Start menu and launch x64 Native Tools Command Prompt for VS
Windows 11 Start menu
  • Now we need to enable conda in the opened command prompt window. For that, execute the following command:
powershell -ExecutionPolicy ByPass -NoExit -Command "& 'C:\\Users\\$USERNAME\\miniconda3\\shell\\condabin\\conda-hook.ps1' ; conda activate 'C:\\Users\\$USERNAME\\miniconda3' “

where $USERNAME is your Windows user name.

  • Go to the desired directory when you would like to run LLAMA, for example your user folder
cd C:\Users\$USERNAME
  • Create and activate conda environment
conda create -n llama
conda activate llama
  • Install git
conda install git
  • Install Cuda Toolkit 11.7
conda install cuda --channel nvidia/label/cuda-11.7.0b
  • Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
  • Clone Oobabooga text generation web UI git repository
git clone https://github.com/oobabooga/text-generation-webui.git
  • Go inside the cloned directory and create repositories folder
cd text-generation-webui
mkdir repositories
cd repositories
  • Clone GPTQ-for-LLaMa git repository, we’ll need it to run 4-bit model
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
  • Now the hardest part. We’re going to build 4-bit CUDA kernels with C++ compiler. Here I run into tricky issue I was fighting quite much time. The thing is that after compiling sources we need to use linker to complete the build and it turned out that there are two link.exe files: the linker’s one and link.exe in miniconda enviroinment directory that emulates link Linux command. Keeping that in mind, we can just rename miniconda’s link.exe and avoid “substitution”. Go to the Library\usr\bin\ of llama env and rename link.exe to link2.exe as on the picture below
Renaming link.exe file in Windows Explorer
  • Now let’s build CUDA 4-bit kernels
cd GPT-for-LLAMA
$env:DISTUTILS_USE_SDK=1
python setup_cuda.py install
  • Go back to the text web ui root folder
cd ..\..
  • It’s time to get the model weights
python download-model.py --text-only decapoda-research/llama-7b-hf
  • We also need a pre-converted 4-bit model. Download llama-7b-4bit.pt file and place it into models folder. Here’s how it should look:
Downloaded model files in /models folder
  • Run the web UI
python server.py --gptq-bits 4 --model llama-7b-hf
Web UI is running in Terminal window
  • Open UI in the browser by clicking on URL in the terminal window
Web UI with AI generated text

Bonus step: run in chat mode

If you prefer ChatGPT like style, run the web UI with --chat or --cai-chat parameter:

python server.py --gptq-bits 4 --model llama-7b-hf --chat

Wrapping up

That’s it! Now you can dive in and explore bigger models and 8-bit models. You can also try AutoGPT instead of GPTQ-for-LLAMA. Please see links in the references section for further readings.

Happy coding!

References

  1. https://medium.com/@leennewlife/how-to-setup-pytorch-with-cuda-in-windows-11-635dfa56724b
  2. https://github.com/oobabooga/text-generation-webui
  3. https://github.com/qwopqwop200/GPTQ-for-LLaMa
  4. https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model
  5. https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/
  6. https://huggingface.co/decapoda-research
  7. https://cocktailpeanut.github.io/dalai
  8. https://github.com/PanQiWei/AutoGPTQ
  9. https://replicate.com/blog/run-llama-locally

--

--