Published in

USF-Data Science

5 min readJun 7, 2024

Quantization Fundamentals for Model Compression

Why do we need Quantization?

In my past experience working at an AI startup, accessing memory-intensive hardware was a constant challenge. As AI models continue to grow in complexity and size, the hardware requirements also increase. Often, the demands for memory and computing power create a significant gap between what is theoretically possible and what can realistically be implemented on consumer-grade hardware. This is why model compression methods like quantization are so crucial today. They allow large generative AI models to run on consumer-grade hardware with minimal to no loss in performance.

I found DeepLearning.AI’s new short course, “Quantization Fundamentals,” an incredibly useful resource. This blog shares some of my key takeaways from the course, focusing on:

Overview of model compression
Explanation of data types
Memory footprint comparisons of open-source models
Linear Quantization
The Quanto library

*Libraries needed:

# Python Version: Python 3.11.9
!pip install torch==2.1.1
!pip install transformers==4.35.0
!pip install quanto==0.0.11

Please refer to this Google Colab Notebook for the code in this blog, and feel free to leave your questions and thoughts in the comments.

Overview of Model Compression

There are several common techniques used to run large models on accessible accelerators:

Pruning involves removing layers of a model that contribute little to its performance.
Knowledge distillation trains a smaller student (target-compressed) model using the output from the teacher model in addition to the main loss term. This method still requires significant computational resources to fit the initial teacher model.
Quantization is to store the parameters (model weights/ activation values) in data types that have a lower precision and, therefore, less memory. By default, model parameters are usually stored in 32-bit floating-point (FP32) format. Quantization allows you to store these parameters in a lower-precision format, like 16-bit floating-point (FP16) or even 4-bit integer (INT4), saving considerable memory (we’ll see an example below).
Downcasting refers to converting a higher-precision data type (float) into a lower-precision one (integer).

tensor_fp32_to_bf16 = tensor_fp32.to(dtype = torch.bfloat16)

Mixed precision training means we do the computation in smaller precision like FP16 or BF16, but store and update in higher precision like FP32. This technique balances out the memory usage and model accuracy.

Different Data Types for Model Compression

Let’s first understand the data types we mentioned above. The choice of data type can affect not only memory usage but also computation speed and model accuracy.

Integer Int8 (8-bit)

Unsigned Integer: Represents only positive integers.
Signed Integer: Represents both negative and positive integers.

Floating Points

Floating-point numbers offer a broader range of values and higher precision compared to integers. They consist of three parts: sign, exponent (range), and fraction (precision). Here’s a breakdown of common floating-point types:

This means the fewer bits we use, the less precise our values are.

Compare Models in Different Dtypes

Let’s compare the memory footprint of open-source models to understand the impact of different data types for a language task. In this blog, we use BertForMaskedLM for example. If you would like to explore different modalities, the course uses BlipForConditionalGeneration for images.

from transformers import BertForMaskedLM
import torch

# Load a pre-trained BERT model
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# Convert to FP16
model_fp16 = model.half()

# Convert to BF16
model_bf16 = model.to(torch.bfloat16)

However, a common issue with PyTorch is that not all transformer-based models support float16 (FP16) due to numerical stability issues. In such cases, BF16 (bfloat16) is a better alternative because it maintains a broader range while reducing precision.

Quantization Error is the difference between the original FP32 model and the quantized INT8 model. Let’s examine the differences in output when using different data types to measure this impact:

# Get logits from the FP32 and BF16 models
logits_fp32 = model(input_ids).logits
logits_bf16 = model_bf16(input_ids).logits

# Calculate mean and max differences
mean_diff = torch.abs(logits_bf16 - logits_fp32).mean().item()
max_diff = torch.abs(logits_bf16 - logits_fp32).max().item()

# Mean difference between FP32 and BF16: 0.03718937560915947
# Max difference between FP32 and BF16: 0.4945411682128906

To see the benefits of using different data types, let’s compare the memory footprint of the models:

# Memory footprint in bytes for FP32
memory_fp32 = model.get_memory_footprint()

# Memory footprint in bytes for BF16
memory_bf16 = model_bf16.get_memory_footprint()

print("Footprint of the fp32 model in MBs: ", 
      fp32_mem_footprint/1e+6)
print("Footprint of the bf16 model in MBs: ", 
      bf16_mem_footprint/1e+6)
# Footprint of the fp32 model in MBs:  438.065384
# Footprint of the bf16 model in MBs:  219.036788

relative_diff = bf16_mem_footprint / fp32_mem_footprint
# Relative diff: 0.500009350202389

From our example, we see that loading our BertForMaskedLM model in BF16 saves half the memory footprint of the original FP32 model.

In practice, if we want to avoid loading bigger models in the first place so, we can set the default to a smaller data type:

torch.set_default_dtype(torch.bfloat16)

Linear Quantization

Now that we are familiar with different data types, let’s dive into how Post Training Quantization(PTQ) transforms work. PTQ transforms high-precision floating-point data (such as FP32) into lower-precision integer data (like INT8) through linear quantization. The linear mapping uses a scale factor and a zero-point offset to maintain the relationship between the original and quantized values. Here’s the formula for this transformation:

Where:

r: the original floating-point representation.
s: scale factor in FP32, defines the step size in the quantized representation. A higher s leads to a rough resolution, whereas a lower scale factor leads to a finer resolution.
q: the quantized representation, using 8-bit integers.
z: zero-point in INT8, serves as a reference point for scaling.

Using the `quanto` Library

The `quanto` library provides an easy way to quantize models in PyTorch. It allows users to convert linear layers into quantized versions (Linear -> QLinear)

Here’s an example of how to quantize a linear layer using the `quanto` library:

from quanto import quantize, freeze
model = YourModel()
quantize(model, weights=torch.int8, activations=None)
freeze(model)

After quantization, the model parameters are stored in INT8, and you can de-quantize back to FP32 when needed.

In this article, we discussed the key concepts of model compression, focusing on the role of quantization in reducing the memory footprint of AI models. To understand the fundamentals, we explored the various data types involved and compared the memory usage of popular open-source models to understand the benefits of quantization. We also introduced the math behind linear quantization and the “quanto” library that facilitates efficient quantization.

That’s all the basic concepts you need to know about quantization. If you are interested in how to fine-tune a quantized model such as Quantization Aware Training (QAT) or Quantized Low-Rank Adaptation of Large Language Models (QLoRA), please visit the course for more detailed information and some hands-on practices.