Simple Tutorial to Quantize Models using llama.cpp from safetensors to gguf

3 min readMar 29, 2024

What is quantization?

According to huggingface quantization is essentially a way to compress weights and activations from a higher precision (bit) representation to a lower precision representation to reduce cost and complexity of inference.

Requirements:
conda, python linux or wsl, build essentials

Lets begin:

We are interested in using phi2 as our target

First we need to pull the model in to a folder using git:

To do this the first step is to install git-lfs (Git Large File Storage)

sudo apt-get install git git-lfs
git lfs install
git-lfs clone https://huggingface.co/microsoft/phi-2

Lets also create an environment via conda:

conda create — name phi_llm
conda activate phi_llm

We now will use llama.cpp to convert the safe tensors to gguf format. To do this clone llama.cpp and install the requirements and build via make.

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
pip install -r requirements.txt
make -j8 
# if you have cuda enabled Try using the below instead
# make LLAMA_CUBLAS=1 -j8

Convert the model

Then use the convert script to convert the models from hf to gguf:

./convert-hf-to-gguf.py ../phi-2/ — outfile phi-2_fp16.gguf
cd ../

This will not be instant. The next part is to convert the model from fp16 to q4

You can convert it using llamacpp or can use the low level api from llama-cpp-python:

./llama.cpp/quantize phi-2_fp16.gguf phi-2_Q4_K_M.gguf Q4_K_M

Can test it with:


llama.cpp/main — model phi-2_Q4_K_M.gguf — interactive
# if you want to use a GPU then try: 
# llama.cpp/main — model phi-2_Q4_K_M.gguf — interactive -ngl <number of layers your gpu can handle (3090 can do all layers)>

We can do the same with llama-cpp-python via low level api:

pip install llama-cpp-python
# If you want to use cuda try this:
# CMAKE_ARGS=”-DLLAMA_CUBLAS=on” pip install llama-cpp-python

We will use the following example but make some changes
https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/quantize.py

python3

import os
import argparse
import llama_cpp
from llama_cpp import llama_model_quantize_params

result = llama_cpp.llama_model_quantize("phi-2_fp16.gguf".encode("utf-8"), "phi-2_Q4_1_low_level.gguf".encode("utf-8"), llama_model_quantize_params(0,3,True, True, False))

You should see this: