Introduction to AI Model Quantization Formats

Gen. David L.
5 min readOct 24, 2023
Photo by Google DeepMind

When downloading models on HuggingFace, you often come across model names with labels like FP16, GPTQ, GGML, and more. For those unfamiliar with model quantization, these labels can be confusing. This article will introduce some common model quantization formats.

What is Quantization?

Quantization, especially in AI models and deep learning models, typically refers to converting the model’s parameters, such as weights and biases, from floating-point numbers to integers with lower bit widths, for example, from 32-bit floating-point to 8-bit integers. In simple terms, quantization is like simplifying a detailed book written with high-level vocabulary into a concise summary or a children’s version of the story. This summary or children’s version takes up less space and is easier to communicate, but it may lose some of the details present in the original book.

Why Quantization

The purpose of quantization mainly includes the following points:

1. Reduced Storage Requirements: Quantized models have significantly smaller sizes, making them easier to deploy on devices with limited storage resources, such as mobile devices or embedded systems.

2. Accelerated Computation: Integer operations are generally faster than floating-point operations, especially on devices without dedicated floating-point hardware support.

3. Reduced Power Consumption: On certain hardware, integer operations consume less energy.

However, quantization has a drawback: it can lead to a reduction in model accuracy. This is because you are representing the original floating-point numbers with lower precision, which may result in some loss of information, meaning the model’s capabilities may decrease.

To balance this accuracy loss, researchers have developed various quantization strategies and techniques, such as dynamic quantization and weight sharing, which can reduce the required overhead while minimizing the reduction in model capabilities. For example, if the full capability of a model is 100, and the model size and inference memory requirements are also 100, when we quantize this model, its capability may decrease to 90, but the model size and inference memory requirements may decrease to 50. This is the purpose of quantization.

FP16/INT8/INT4

On HuggingFace, if a model’s name doesn’t have specific identifiers like Llama-2–7b or chatglm2–6b, it generally indicates that these models are in full precision (FP32), although some may also be in half-precision (FP16). However, if the model’s name includes terms like fp16, int8, int4, such as Llama-2–7B-fp16, chatglm-6b-int8, or chatglm2–6b-int4, it suggests that these models have undergone quantization, with fp16, int8, or int4 denoting the level of quantization precision.

The quantization precision ranges from high to low as follows: fp16 > int8 > int4. Lower quantization precision results in smaller model sizes and reduced GPU memory requirements. Still, it can also lead to decreased model performance.

Take ChatGLM2–6B as an example. The full-precision version (FP32) of this model has a size of 12 GB and requires around 12–13 GB of GPU memory during inference. In contrast, the quantized INT4 version of the model has a size of 3.7 GB and requires 5 GB of GPU memory during inference. As you can see, quantized models have significantly reduced sizes and memory requirements.

Models with FP32 and FP16 precision usually need to run on GPU servers, while models with INT8 and INT4 precision can run on CPUs.

GPTQ

GPTQ is a model quantization method that allows language models to be quantized to precision levels like INT8, INT4, INT3, or even INT2 without significant performance loss. If you come across model names on HuggingFace with “GPTQ” in their names, such as Llama-2–13B-chat-GPTQ, it means these models have undergone GPTQ quantization. For example, consider Llama-2–13B-chat, the full-precision version of this model has a size of 26 GB, but after quantization using GPTQ to INT4 precision, the model’s size reduces to 7.26 GB.

If you are using the open-source Llama model, you can employ the GPTQ-for-LLaMA[2] library to perform GPTQ quantization. It allows you to quantize relevant Llama models to INT4 precision.

However, a more popular GPTQ quantization tool now is AutoGPTQ. It can quantize not only Llama but also any Transformer model.

GGML

On HuggingFace, if you come across model names with “GGML,” such as Llama-2–13B-chat-GGML, it indicates that these models have undergone GGML quantization.

Some GGML model names not only include “GGML” but also have suffixes like “q4,” “q4_0,” “q5,” and so on, such as Llama-2–7b-ggml-q4. In this context, “q4” refers to the GGML quantization method.

GPTQ vs GGML

GPTQ and GGML are currently the two primary methods for model quantization, but what are the differences between them? And which quantization method should you choose?

Here are some key similarities and differences between the two:

  • GPTQ runs faster on GPUs, while GGML runs faster on CPUs.
  • Models quantized with GGML tend to be slightly larger than those quantized with GPTQ at the same precision level, but their inference performance is generally comparable.
  • Both GPTQ and GGML can be used to quantize Transformer models available on HuggingFace.

Therefore, if your model runs on a GPU, it’s advisable to use GPTQ for quantization. If your model runs on a CPU, GGML is a recommended choice for quantization.

Groupsize

On HuggingFace, you may often come across model names that include terms like “32g” or “128g,” such as “pygmalion-13b-4bit-128g.” What do these terms signify?

The “128g” with the “g” referring to “groupsize” indicates that in quantization techniques, model weights may be divided into groups of a specific size (in this case, groupsize 128), and a specific quantization strategy is applied to each group. These strategies can help improve the effectiveness of quantization or maintain the model’s performance.

Summary

This post introduces an overview of the common quantization formats for models on HuggingFace. Quantization techniques are essential for AI model deployment as they significantly reduce the model’s size and the required GPU memory for inference. If we want to make large language models accessible to ordinary people and run them on mobile devices, truly achieving “ubiquity,” then quantization technology will undoubtedly be indispensable in the future.

--

--

Gen. David L.

AI practitioner & python coder to record what I learned in python project development