Running the Llama-2 LLM at home

Bobby Mantoni
2 min readJul 29, 2023

--

This turned out to be surprisingly easy. Huggingface makes many pre-trained models available, including Llama-2, along with some libraries that make them easy to load and run.

I used WSL to ensure that my graphics card would be available. I normally code in a Linux VM at home, but that would create problems for the graphics driver and cuda toolkit. Make sure to use WSL2.

I installed the latest CUDA toolkit, which required me to use the nightly build of torch and to build Auto-GPTQ from source. Using the stable releases led to compatibility issues. Torch would see cuda (torch.cuda.is_available()==True), but Auto-GPTQ/transformers would give the error: “CUDA extension not installed”. Building auto-gptq from source made the issue clear: “The detected CUDA version (12.2) mismatches the version that was used to compile PyTorch (11.8)”

# 1 download and install cuda toolkit .deb for Ubuntu-WSL (12.2)
# 2 confirm GPU is available and cuda version with:
nvidia-smi
# 3 then do the rest
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .

Running inference on a 6.7GB 13B parameter model looks like this:

  • So it reads the model from disk, pushes that into system RAM and VRAM, then the GPU start chewing through it.
  • This is using my GTX 1660 Super. It has 1408 cuda cores and 6GB VRAM. It takes about 200s to do an inference.
  • It takes 7–11s to load the 6.7GB model, using an m.2 NVMe SSD and 3600MHz system RAM. The SSD can do over 3.5 GB/s sequential read.
  • The GPU draws 50–65W throughout, eats up all the 6GB of VRAM, and 16GB of system RAM.

Looks like I have another excuse to upgrade my GPU :)

--

--

Bobby Mantoni

Parallel programming, CUDA, AI and Philosophy. Degrees in both. Software engineering veteran. Father, Proud American.