GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

How to quantize and run GGUF LLMs with llama.cpp — Example with Qwen1.5

Benjamin Marie
5 min readMar 4, 2024
Generated by DALL-E

Quantization of large language models (LLMs) with GPTQ and AWQ yields smaller LLMs while preserving most of their accuracy in downstream tasks. These quantized LLMs can also be fast during inference when using a GPU, especially with optimized CUDA kernels and an efficient backend, e.g., ExLlama for GPTQ.

However, GPTQ and AWQ implementations are not optimized for inference using a CPU. Most implementations can’t even offload parts of GPTQ/AWQ quantized LLMs to the CPU RAM when the GPU doesn’t have enough VRAM. In other words, if the model is still too large to be loaded in the GPU VRAM after quantization, inference will be extremely slow.

A popular alternative is to use llama.cpp (MIT license) to quantize LLMS with the GGUF format. This approach can run very fast quantized LLMs on the CPU.

In this article, we will see how to easily quantize LLMs and convert them in the GGUF format using llama.cpp. This method supports many LLM architectures such as Mixtral-8x7b, Mistral 7B, Qwen1.5, and Google’s Gemma.

--

--

Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/