Running the Llama-2 LLM at home

2 min readJul 29, 2023

This turned out to be surprisingly easy. Huggingface makes many pre-trained models available, including Llama-2, along with some libraries that make them easy to load and run.

I used WSL to ensure that my graphics card would be available. I normally code in a Linux VM at home, but that would create problems for the graphics driver and cuda toolkit. Make sure to use WSL2.

I installed the latest CUDA toolkit, which required me to use the nightly build of torch and to build Auto-GPTQ from source. Using the stable releases led to compatibility issues. Torch would see cuda (torch.cuda.is_available()==True), but Auto-GPTQ/transformers would give the error: “CUDA extension not installed”. Building auto-gptq from source made the issue clear: “The detected CUDA version (12.2) mismatches the version that was used to compile PyTorch (11.8)”

# 1 download and install cuda toolkit .deb for Ubuntu-WSL (12.2)
# 2 confirm GPU is available and cuda version with:
nvidia-smi
# 3 then do the rest
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .

Running inference on a 6.7GB 13B parameter model looks like this:

So it reads the model from disk, pushes that into system RAM and VRAM, then the GPU start chewing through it.
This is using my GTX 1660 Super. It has 1408 cuda cores and 6GB VRAM. It takes about 200s to do an inference.
It takes 7–11s to load the 6.7GB model, using an m.2 NVMe SSD and 3600MHz system RAM. The SSD can do over 3.5 GB/s sequential read.
The GPU draws 50–65W throughout, eats up all the 6GB of VRAM, and 16GB of system RAM.

Looks like I have another excuse to upgrade my GPU :)

Running the Llama-2 LLM at home

Written by Bobby Mantoni