GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

How to quantize and run GGUF LLMs with llama.cpp — Example with Qwen1.5

5 min readMar 4, 2024

Quantization of large language models (LLMs) with GPTQ and AWQ yields smaller LLMs while preserving most of their accuracy in downstream tasks. These quantized LLMs can also be fast during inference when using a GPU, especially with optimized CUDA kernels and an efficient backend, e.g., ExLlama for GPTQ.

However, GPTQ and AWQ implementations are not optimized for inference using a CPU. Most implementations can’t even offload parts of GPTQ/AWQ quantized LLMs to the CPU RAM when the GPU doesn’t have enough VRAM. In other words, if the model is still too large to be loaded in the GPU VRAM after quantization, inference will be extremely slow.

A popular alternative is to use llama.cpp (MIT license) to quantize LLMS with the GGUF format. This approach can run very fast quantized LLMs on the CPU.

In this article, we will see how to easily quantize LLMs and convert them in the GGUF format using llama.cpp. This method supports many LLM architectures such as Mixtral-8x7b, Mistral 7B, Qwen1.5, and Google’s Gemma.

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

How to quantize and run GGUF LLMs with llama.cpp — Example with Qwen1.5

Written by Benjamin Marie