The easiest way to convert a model to GGUF and Quantize

D
7 min readJun 18, 2024

--

Convert PyTorch & Safetensors > GGUF

If you need Full Precision F32, F16, or any other Quantized format, use the llama.cpp docker container, which is the most convenient on macOS/Linux/Windows:

mkdir -p ~/models
huggingface-cli login
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 --local-dir "~/models" --include "*"

#Convert to GGUF
docker run --rm -v "~/models":/repo ghcr.io/ggerganov/llama.cpp:full --convert "/repo" --outtype f32
ls ~/models | grep .gguf
#> ggml-model-f32.gguf

#Quantize from F32.gguf to Q4_K_M.bin
docker run --rm -v "~/models":/repo ghcr.io/ggerganov/llama.cpp:full --quantize "/repo/ggml-model-f32.gguf" "/repo/ggml-model-Q4_K_M.bin" "Q4_K_M"
ls ~/models | grep .bin
#> ggml-model-Q4_K_M.bin

That’s it!

Alternatively, use ollama/quantize container:

mkdir -p ~/models
huggingface-cli login
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 --local-dir "~/models" --include "*"

#q6_K
docker run --rm -v "~/models":/repo ollama/quantize -q q6_K /repo

ls ~/models | grep .bin
#> f16.bin
#> q6_K.bin

#If you wish to add this model to Ollama
echo "FROM ~/models/f16.bin" > "~/models/modelfile"
ollama create "Mistral-Instruct-v0.3:7b" -f "~/models/modelfile"

You can even try directly importing PyTorch / Safetensors to Ollama without creating a temporary GGUF file like this:

model=mistralai/Mistral-7B-Instruct-v0.3
modelname=Mistral:7b-Instruct-v0.3
modelfolder=${PWD}/${model}

huggingface-cli login
huggingface-cli download "${model}" --local-dir "${modelfolder}" --include "*"

#Pointing to directory with PyTorch/Safetensors might not always work, try creating GGUF
echo "FROM ${modeldir}" > "${modeldir}/modelfile"
ollama create "${modelname}" -f "${modeldir}/modelfile"

#If pointing to folder doesn't work, then with GGUF it should work:
echo "FROM ${modeldir}/ggml-model-f16.gguf" > "${modeldir}/modelfile"
ollama create "${modelname}" -f "${modeldir}/modelfile"

The models will be in the same folder with the .bin extension. Half-Precession F16 (i.e. 16-Bit Floating Point) automatically created.

Note for Ollama/llama.cpp F16 is considered as “Full Precession”, for the consumer-grade computers.

Brew & macOS

On macOS you can use brew, though it sometimes might not be reliable due to dependencies.

brew install llama.cpp
ls /opt/homebrew/bin | grep llama
mkdir -p ~/models
huggingface-cli login
huggingface-cli download "mistralai/Mistral-7B-Instruct-v0.3" --local-dir "~/models" --include "*"
#Convert to GGUF
llama-gguf "mistralai/Mistral-7B-Instruct-v0.3" --outtype f32
#Quantize GGUF
llama-quantize "mistralai/Mistral-7B-Instruct-v0.3/ggml-model-f32.gguf" "mistralai/Mistral-7B-Instruct-v0.3/Q4_K_M.bin" "Q4_K_M"

Hugging Face GGUF Quants Q2-Q8

https://huggingface.co/spaces/ggml-org/gguf-my-repo

The hard way

Here is the hardest way to make a build from source code. Though often tricky due to various dependencies and potions you’ll have to adjust for your environment, OS, GPU hardware etc. manually:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Security Considerations

Converting a model to GGUF is essential for compatibility with many inference engines like Ollama or LocalAI. While pre-made GGUF files are often available on platforms like Hugging Face, the source and integrity of these files can be questionable. They could potentially contain malicious code that, although might not directly harm your computer, could exploit your GPU for unauthorized purposes.

Dependencies:

  • Docker or Docker Desktop
  • Install & login: huggingface-cli login
  • At the Model Card HigginFace might require accepting terms

Available options

llama.cpp

usage: ./llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
--pure: Disable k-quant mixtures and quantize all tensors to the same type
--imatrix file_name: use data in file_name as importance matrix for quant optimizations
--include-weights tensor_name: use importance matrix for this/these tensor(s)
--exclude-weights tensor_name: use importance matrix for this/these tensor(s)
--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
--keep-split: will generate quatized model in the same shards as input --override-kv KEY=TYPE:VALUE
Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B
3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B
8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B
9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
31 or IQ1_M : 1.75 bpw quantization
10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
21 or Q2_K_S : 2.16G, +9.0634 ppl @ LLaMA-v1-7B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B
12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B
13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B
15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B
17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B
18 or Q6_K : 5.15G, +0.0008 ppl @ LLaMA-v1-7B
7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B
1 or F16 : 14.00G, -0.0020 ppl @ Mistral-7B
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing

Ollama:

ollama/quantize -h
usage: entrypoint.sh [-m yes|no] [-n] [-q QUANT] MODEL

Converts a Pytorch model to GGUF format.

Flags:
-m yes|no - Merge the base model with the projector. Default: yes.
-n - Dry run. Do not actually run any commands.
-q QUANT - Quantization type. One of:
- q4_0 (default)
- q4_1
- q5_0
- q5_1
- q8_0
- q2_K
- q3_K_S
- q3_K_M
- q3_K_L
- q4_K_S
- q4_K_M
- q5_K_S
- q5_K_M
- q6_K
- f16

Read my other articles about AI:

Model Architecture

Unfortunately, not all the models available on the HF can be converted into GGUF format with llama.cpp. If the architecture of the model is not supported by llama.cpp, the tool will crush with en error that the model architecture is not supported and conversion is not possible.

Great Video

BitNet b1.58

It is interesting to see how different forces and thoughts shape the architecture of modern LLMs, finding a balance between several forces and ideas, and how fast things are changing in the AI industry. In the beginning, we concluded that more is better: more data, more parameters, and more models. Then we realized that the models had grown to some mega-monstrosity size and decided to reduce them by quantization. Having reached the extreme of 1-bit quality compression, we suddenly realized that we could make 1-bit models from the very beginning, without the need for floating-point models, and not compressing it but training it right away, so that on the one hand we could store all the same data as f32, but what is interesting in practice is that it shows a much smaller size, because previously even a round 0 in f32 format would still take 32 bits, but in 1-bit format 0 would still take 1 bit, and the same if an 8-bit number in binary form in the 1-bit model would take 8 pieces of one bit each, but in f32 it would still take all 32 bits. On the one hand, this provides built-in compression “by design”, and on the other hand, it preserves full quality. 1-bit models are also known as BitNet b1.58. Also turns out 1-bit models are easier to process since some of the heavy-load calculation operations on floating point numbers are no longer needed. Shortly, trained new b1.58 models will take up much less memory space, which should give a significant boost to the development of AI, as we will be able to either get the same quality using fewer computer resources or, on the contrary, use all the same resources to the fullest as previous models, but these models will have more built-in knowledge and, as a result, will be potentially much better.
Example: 1bitLLM/bitnet_b1_58-3B

Enjoyed This Story?

If you like this topic and you want to support me:

  1. Clap 👏 my article 10 times; that will help me out
  2. Follow me to get my latest articles and Join our AI Sky Discord server 🫶
  3. Share this article on social media ➡️🌐
  4. Give me feedback in the comments 💬 below. It’ll help me to better understand that this work was useful, even a simple “thanks” will do. Give me good, give me bad, whatever you think as long as you tell me place to improve and how.
  5. Connect with me or follow me on LinkedIn.

--

--