Simple Tutorial to Quantize Models using llama.cpp from safetensors to gguf
What is quantization?
According to huggingface quantization is essentially a way to compress weights and activations from a higher precision (bit) representation to a lower precision representation to reduce cost and complexity of inference.
Requirements:
conda, python linux or wsl, build essentials
Lets begin:
We are interested in using phi2 as our target
First we need to pull the model in to a folder using git:
To do this the first step is to install git-lfs (Git Large File Storage)
sudo apt-get install git git-lfs
git lfs install
git-lfs clone https://huggingface.co/microsoft/phi-2
Lets also create an environment via conda:
conda create — name phi_llm
conda activate phi_llm
We now will use llama.cpp to convert the safe tensors to gguf format. To do this clone llama.cpp and install the requirements and build via make.
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
pip install -r requirements.txt
make -j8
# if you have cuda enabled Try using the below instead
# make LLAMA_CUBLAS=1 -j8
Convert the model
Then use the convert script to convert the models from hf to gguf:
./convert-hf-to-gguf.py ../phi-2/ — outfile phi-2_fp16.gguf
cd ../
This will not be instant. The next part is to convert the model from fp16 to q4
You can convert it using llamacpp or can use the low level api from llama-cpp-python:
./llama.cpp/quantize phi-2_fp16.gguf phi-2_Q4_K_M.gguf Q4_K_M
Can test it with:
llama.cpp/main — model phi-2_Q4_K_M.gguf — interactive
# if you want to use a GPU then try:
# llama.cpp/main — model phi-2_Q4_K_M.gguf — interactive -ngl <number of layers your gpu can handle (3090 can do all layers)>
We can do the same with llama-cpp-python via low level api:
pip install llama-cpp-python
# If you want to use cuda try this:
# CMAKE_ARGS=”-DLLAMA_CUBLAS=on” pip install llama-cpp-python
We will use the following example but make some changes
https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/quantize.py
python3
import os
import argparse
import llama_cpp
from llama_cpp import llama_model_quantize_params
result = llama_cpp.llama_model_quantize("phi-2_fp16.gguf".encode("utf-8"), "phi-2_Q4_1_low_level.gguf".encode("utf-8"), llama_model_quantize_params(0,3,True, True, False))
You should see this:
Can test it with:
llama.cpp/main — model phi-2_Q4_K_M.gguf — interactive
# if you want to use a GPU then try:
# llama.cpp/main — model phi-2_Q4_K_M.gguf — interactive -ngl <number of layers your gpu can han
In conclusion, we have shown a straightforward way to convert a model from safetensors to gguf and 2 ways to quantize the weights. In this tutorial we converted a model from fp16 precision to a q4 precision. We have shown that we can use the llama.cpp source or using the llama-cpp-python low level api. We did this on a model LLM from Microsoft called Phi-2.