Quantization of LLMs with llama.cpp

Understanding and Implementing n-bit Quantization Techniques for Efficient Inference in LLMs

11 min readMar 15, 2024

Large Language Models (LLMs), especially big ones like Mixtral 8x7b (46.7 billion parameters), can be quite demanding in terms of memory. This memory demand becomes apparent when you’re trying to reduce inference costs, increase inference speed, or do inference on edge devices. A potential solution to this problem is quantization. In this article, we’ll simplify the concept of quantization using easy-to-understand analogies and provide a practical guide to implementing it in your LLMs.

Introduction to Quantization

LLMs, while powerful, consume significant resources due to their large model sizes. This poses challenges for deployment on resource-constrained devices and can hinder inference speed and efficiency. Quantization offers a solution by reducing the precision of model parameters while maintaining performance.

In this article, we’ll explore various quantization techniques, including naive quantization, k-means quantization, and briefly mention a method called qLORA.

Understanding Quantization via a House Analogy

Imagine each house in a countryside representing a parameter in your LLM. In a dense model, houses are everywhere, like a bustling city. Quantization transforms this city into a more manageable countryside by keeping only the most important houses (parameters) and replacing others with smaller ones (lower precision representations), or alternatively removing very “unimportant” houses and creating open space (zeroes) in between. Since there is space, we can say that the model is “more sparse”, or “less dense”. When the model is sparse, with many zero-valued parameters, this makes the model more computationally efficient and faster to process, as the zero values can be easily skipped or compressed without the need for costly calculations. The open space (zeroes) between the retained parameters reduces the overall model size and complexity, further enhancing efficiency.

Visualizing Quantization | Image by Author

Quantization offers several benefits:

Reduced Memory Footprint: By reducing parameter precision, quantization significantly decreases the model’s memory requirements, crucial for deployment on memory-limited devices.
Increased Speed: Lower precision computations are carried out faster, leading to quicker model inference, particularly beneficial for real-time applications.
Maintained Performance: Quantization aims to simplify models while preserving their performance, ensuring the countryside still has all the necessary facilities after downsizing.

Types of Quantization

There are several quantization methods, this article shortly mentions two of these:

Naive Quantization:

Naive quantization uniformly reduces the precision of all parameters, similar to dividing a countryside into equal square regions without considering the location of houses. This can lead to some regions having many houses and others having none.

K-means Quantization:

K-means quantization creates clusters based on the actual locations of data points (houses), resulting in a more accurate and efficient representation. It involves choosing representative points (centroids) and assigning each data point to the nearest centroid.

Some implementations of k-means quantization may include an additional pruning step, where parameters with values below a certain threshold are set to exactly zero (i.e. removing the house). This can further increase the model’s sparsity.

Rough Visualization of the difference between Naïve vs K-Means Quantization | Image by Author

Sparsity & Density

Sparsity refers to a model where only a few parameters (houses) are significant, and the rest (empty space) can be ignored without affecting performance. When reducing resolution through quantization, the unevenness in parameter importance becomes more apparent.

This can be measured through a metric called sparsity. It is the percentage of 0-value parameters in the model.

Sparsity = (Number of zero-valued parameters) / (Total number of parameters)

High sparsity means a high percentage of 0-value parameters (space between houses)

You can also represent a model in terms of its density:

Density = (Number of non-zero parameters) / (Total number of parameters)

High density means high percentage of non-0-value parameters (houses)

Interpreting Model Names: What does the Q#_K_M mean in quantized models?

In the context of llama.cpp, Q4_K_M refers to a specific type of quantization method. The naming convention is as follows:

Q stands for Quantization.
4 indicates the number of bits used in the quantization process.
K refers to the use of k-means clustering in the quantization.
M represents the size of the model after quantization.
(S = Small, M = Medium, L = Large).

When Is Quantization Useful?

Inference at the Edge

Edge computing is a way of processing data that happens close to where the data is generated, instead of sending it to a faraway server for processing. Inference at the edge can be helpful because it can make things faster, keep data more private, and use less bandwidth.

For example, imagine you’re using a smartphone app that uses machine learning to recognize objects in photos. If the app was using traditional cloud computing, it would have to send the photo all the way to a server, wait for the server to process it, and then send the result back to your phone. With edge computing, the processing happens right on your phone, so it’s faster, private, and uses less data.

Corporate Applications of Model Quantization

While quantization is often associated with edge computing, it can also be useful in other contexts. For example, corporations can use quantization to reduce the computational and storage requirements of their machine learning models in corporate data centers. This can lead to significant cost savings in terms of both hardware and electricity usage.

A small quantization model for a corporate data center can be much cheaper to run and maintain than a larger, more precise model. This is because quantization reduces the amount of memory and computational power required to run the model, which can lead to lower hardware costs and reduced energy consumption. Furthermore, quantization can also help to improve the scalability of machine learning models, allowing corporations to handle larger volumes of data and make predictions more quickly.

Extending Quantization to Fine Tuning with QLoRA

In any domain where specific requirements or customer-centric adjustments are needed, it can be useful to fine-tune certain parameters to accurately reflect these needs. With QLoRA (Quantization and Low-Rank Adaptation), you can achieve a model that is both quantized and fine-tuned to your specific needs. This method allows for the optimization of the model without the necessity of updating a large number of weights, making it an low-cost & efficient solution for adapting large language models to various domains (i.e. customer service, healthcare, education) or any other field where a more personalized and precise model could enhance performance and user experience.

For more information, see this helpful article by brev.

Quantizing A Model

Ok…enough theory :D Let’s try this with llama.cpp.

Using Llama.cpp to Quantize

This section of the article talks through how to download & make llama.cpp. Then, we will download a model from HuggingFace and quantize it and also run some performance tests.

Big thank you to Peter for the helpful guide through llama.cpp

Step 1: Enable Git to Download Large Files

#Allow git download of very large files; lfs is for git clone of very large files, such as the models themselves
brew install git-lfs
git lfs install

Step 2: Clone the llama.cpp project & run make

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

make

Step 3: Download a Model from HuggingFace

I’ll get one from NousResearch. Note, that you’ll want a good internet connection for this because models can be bigabytes!

#Get model from huggingface, rename it locally to nous-hermes-2-mistral-7B-DPO, and move it to the models directory
#Models generally are in https://huggingface.co/NousResearch
git clone https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO nous-hermes-2-mistral-7B-DPO

Now, we can use terminal to move it to the models folder:

mv nous-hermes-2-mistral-7B-DPO models/

Step 4: Convert the Model to a Standard Format (GGML FP16)

When I say “standard” I mean the GGML FP16 format. GGML is a tensor library developed by Georgi Gerganov for machine learning to enable large models and high performance on commodity hardware. FP16 is considered “half-precision” (FP32 is full-precision). Precision refers to the floating points of the model weights. For more explanation, please see this clear explanation.

# convert the model to FP16 .gguf format 
python3 convert.py models/nous-hermes-2-mistral-7B-DPO/

After running `convert.py`, you should see this ggml-model-f16.gguf appear in your model directory | Image by Author

Step 5: Quantize the Model to n-bits

Now, we can take our ggml-model-f16.gguf file as a starting point for further quantizations.

4-bit Quantization

# quantize the model to 4-bits (using Q4_K_M method)
./quantize ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-f16.gguf ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-Q4_K_M.gguf Q4_K_M

Creating the Q4_K_M.gguf | Image by Author

What about Other Quantizations?

We can do a bunch of different quantizations. See the screenshot for details.
If you run ./quantize --help then you will see all the options for quantization types.

3-bit Quantization

# quantize the model to 3-bits (using Q3_K_M method)
./quantize ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-f16.gguf ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-Q3_K_M.gguf Q3_K_M

Creating the Q3_K_M.gguf | Image by Author

5-bit Quantization

# quantize the model to 5-bits (using Q5_K_M method)
./quantize ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-f16.gguf ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-Q5_K_M.gguf Q5_K_M

Creating the Q5_K_M.gguf | Image by Author

Q2_ Quantization

# quantize the model to 2-bits (using Q2_K method)
./quantize ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-f16.gguf ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-Q2_K.gguf Q2_K

Creating the Q2_K.gguf | Image by Author

Batched Bench

Run ./batched-bench — help

What is batched bench? Batched bench benchmarks the batched decoding performance of the llama.cpp library.

Let’s try batched bench on the f16 version:

./batched-bench ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-f16.gguf 2048 0 999 128,256,512 128,256 1,2,4,8,16,32

And also for the Q4_K_M Quantization:

./batched-bench ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-Q4_K_M.gguf 2048 0 999 128,256,512 128,256 1,2,4,8,16,32

How to Decode Batched Bench

For our evaluation, we can use T_PP (time to first token), S_PP (prompt processing speed) and S_TG (text generation speed). By comparing the two, we see (for example) that the prompt processing speed is ~ 50 tokens/second faster with the Q4 compared to f16.

There are other ways to evaluate the quantized model. One of those is perplexity.

Evaluating Perplexity

Perplexity is a common metric used to evaluate language models. It measures how well the model predicts a sample of data. A lower perplexity score indicates that the language model is better at predicting the next word, while a higher perplexity score suggests that the model is more uncertain or “perplexed” about the next word.

Run perplexity on the model:

This will take approximately 1 hour, should you choose to do it.

# Calculate the perplexity of ggml-model-Q2_K.gguf
./perplexity -m ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-Q2_K.gguf -f /Users/ingrid/Downloads/test-00000-of-00001.parquet

Run the quantized model

# start inference on a gguf model
./main -m ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-Q4_K_M.gguf -n 128

What about other (smaller) quantizations?

This is where things get a bit more tricky and time-consuming. For example, if we wanted to do a XXS quantization:

# XXS Quantization
./quantize ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-f16.gguf ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-IQ2_XXS.gguf IQ2_XXS

…then we’d need to create an importance matrix (imatrix) as you can see in the message within the screenshot below.

What is an Importance Matrix?

The importance matrix (imatrix)assigns an importance score to each weight or activation in the neural network. This importance score is typically calculated based on the sensitivity of the model’s output to changes in that particular weight or activation. The importance matrix allows for targeted quantization, where the most critical components are preserved at higher precision, while less important ones are quantized to save memory and computational resources. This targeted approach is especially important when quantizing to very low precision (i.e., 2-bit or lower), as it helps maintain the model’s usefulness.

How to Build an Importance Matrix

So, I looked into how to do an imatrix from this README and downloaded the wiki raw dataset, and tried this bash command to create a imatrix
(try it out if you have 8 hours, or a more powerful computer than I do):

# see for documentation: https://github.com/ggerganov/llama.cpp/blob/master/examples/imatrix/README.md
./imatrix -m <some_fp_model> -f <some_training_data> 

# This is the code I ran to start generating a imatrix (take 8 hours)
./imatrix -m ./models/nous-hermes-2-mistral-7B-DPO/ggml-model-f16.gguf -f /Users/ingrid/Downloads/test-00000-of-00001.parquet

It would have taken a long time, on my 32GB M1 Mac (it estimated about 8 hours for the imatrix) so I didn’t do it.

Time needed to imatrix | Image by Author

Quantization is a powerful technique for reducing the memory footprint and computational demands of large language models (LLMs) without significantly compromising performance. This article explored quantization in depth, including methods like naive quantization, k-means quantization, and QLoRA for fine-tuning quantized models.

Using relatable analogies and examples, we showed how quantization transforms the dense parameter space of LLMs into a more manageable form, enhancing efficiency and speed. We also covered practical applications of quantized LLMs, from edge computing to enterprise use cases.

The hands-on guide demonstrated quantizing LLMs using llama.cpp, walking through downloading models, conversion, quantization techniques, and evaluation metrics like perplexity.

As LLMs continue pushing boundaries, quantization will play a vital role in making these powerful models more accessible and deployable across diverse environments. Whether you’re a researcher, practitioner, or NLP enthusiast, mastering quantization can be a significant step in your journey with large language models.