How to run LLAMA on Windows 11
Very quick guide
UPD Dec. 2024: This article has become outdated at the time being. It had been written before Meta made the models open source, some things may work differently and don’t work at all.
Today, we’re going to run LLAMA 7B 4-bit text generation model (the smallest model optimised for low VRAM). I’ll try to be as brief as possible to get you up and running quickly. I’m assuming you have Nvidia graphic card with CUDA support, but it’s also possible to run on AMD cards and on CPU only (people have it running even on Raspberyy Pi), you just need specific PyTorch version. But enough words, let’s start.
Steps
- Download and install Visual Studio Build Tools, we’ll need it to build 4-bit kernels PyTorch CUDA extensions written in C++. Select checkboxes as shown on the screenshoot below:
- Install miniconda
- Go to Start menu and launch x64 Native Tools Command Prompt for VS
- Now we need to enable
conda
in the opened command prompt window. For that, execute the following command:
powershell -ExecutionPolicy ByPass -NoExit -Command "& 'C:\\Users\\$USERNAME\\miniconda3\\shell\\condabin\\conda-hook.ps1' ; conda activate 'C:\\Users\\$USERNAME\\miniconda3' “
where $USERNAME
is your Windows user name.
- Go to the desired directory when you would like to run LLAMA, for example your user folder
cd C:\Users\$USERNAME
- Create and activate conda environment
conda create -n llama
conda activate llama
- Install git
conda install git
- Install Cuda Toolkit 11.7
conda install cuda --channel nvidia/label/cuda-11.7.0b
- Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
- Clone Oobabooga text generation web UI git repository
git clone https://github.com/oobabooga/text-generation-webui.git
- Go inside the cloned directory and create
repositories
folder
cd text-generation-webui
mkdir repositories
cd repositories
- Clone GPTQ-for-LLaMa git repository, we’ll need it to run 4-bit model
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
- Now the hardest part. We’re going to build 4-bit CUDA kernels with C++ compiler. Here I run into tricky issue I was fighting quite much time. The thing is that after compiling sources we need to use linker to complete the build and it turned out that there are two link.exe files: the linker’s one and link.exe in miniconda enviroinment directory that emulates
link
Linux command. Keeping that in mind, we can just rename miniconda’slink.exe
and avoid “substitution”. Go to theLibrary\usr\bin\
ofllama
env and renamelink.exe
tolink2.exe
as on the picture below
- Now let’s build CUDA 4-bit kernels
cd GPT-for-LLAMA
$env:DISTUTILS_USE_SDK=1
python setup_cuda.py install
- Go back to the text web ui root folder
cd ..\..
- It’s time to get the model weights
python download-model.py --text-only decapoda-research/llama-7b-hf
- We also need a pre-converted 4-bit model. Download
llama-7b-4bit.pt
file and place it intomodels
folder. Here’s how it should look:
- Run the web UI
python server.py --gptq-bits 4 --model llama-7b-hf
- Open UI in the browser by clicking on URL in the terminal window
Bonus step: run in chat mode
If you prefer ChatGPT like style, run the web UI with --chat
or --cai-chat
parameter:
python server.py --gptq-bits 4 --model llama-7b-hf --chat
Wrapping up
That’s it! Now you can dive in and explore bigger models and 8-bit models. You can also try AutoGPT instead of GPTQ-for-LLAMA. Please see links in the references section for further readings.
Happy coding!
References
- https://medium.com/@leennewlife/how-to-setup-pytorch-with-cuda-in-windows-11-635dfa56724b
- https://github.com/oobabooga/text-generation-webui
- https://github.com/qwopqwop200/GPTQ-for-LLaMa
- https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model
- https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/
- https://huggingface.co/decapoda-research
- https://cocktailpeanut.github.io/dalai
- https://github.com/PanQiWei/AutoGPTQ
- https://replicate.com/blog/run-llama-locally