Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

The power of quantization to run AI on your computer

Benjamin Marie
3 min readAug 5, 2023
Photo by Miranda Salzgeber on Unsplash

On Medium, I mainly discussed QLoRa to run large language models (LLM) on consumer hardware.

But QLoRa was mainly proposed to make fine-tuning more affordable. It’s not the best option for inference if your model is already fine-tuned. For this scenario, GPTQ is much more suitable.

GPTQ in a few words

GPTQ (Frantar et al., 2023) is a quantization algorithm for LLMs. You can see it as a way to compress LLMs.

The 7 billion parameter version of Llama 2 weighs 13.5 GB. After 4-bit quantization with GPTQ, its size drops to 3.6 GB, i.e., 26.6% of its original size.

Loading an LLM with 7B parameters isn’t possible on consumer hardware without quantization. Even when only using the CPU, you still need at least 32 GB of RAM. This is more than what we have in most standard computers. It also doesn’t fit on a Google Colab Pro instance.

But after quantization, we can load the model on most machines, and without a significant drop in model performance (perplexity).

But anyway, don’t we need to load the model in memory before quantizing it?

--

--

Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/