LLM Series - Quantization Overview

Enhancing Efficiency While Maintaining Quality

7 min readSep 6, 2023

Quantization, a technique at the forefront of deep learning, is revolutionizing the landscape of neural network deployment. In this article, we delve into the concept of quantization, its types, advantages, and the practical steps to achieve optimal results. As we explore this powerful technique, we’ll uncover how quantization strikes the delicate balance between computational efficiency and model accuracy.

Quantization is the process of representing weights, bias and activations in neural networks using lower-precision data types, such as 8-bit integers (int8), instead of the conventional 32-bit floating point (float32) representation. By doing so, it significantly reduces the memory footprint and computational demands during inference, enabling deployment on resource-constrained devices.

Interested in Parameter Efficient Finetuning. Please do visit👇

LLM Series — Parameter Efficient Fine Tuning

Maximizing Model Performance with Minimal Resources

medium.com

Diving into Quantization Types:

Number Representation:

For context on how quantization works, explore computer number representation:
— Integer Representation: Binary representation for unsigned and signed integers.
— Real Number Representation: Floating-point representation with sign, exponent, and fraction or significand or mantissa.

1. Float32 to Float16 Quantization:

In this scenario, the transition is from 32-bit floating-point representation to 16-bit floating-point representation. Both data types share the same representation scheme, facilitating a straightforward conversion process. However, compatibility with float16 operations and hardware support is vital for successful implementation.

2. Float32 to bfloat16 Quantization:

Similar to float16, bfloat16 quantization involves transitioning from 32-bit floating-point to 16-bit floating-point representation, with a specific format known as bfloat16. Bfloat16 offers greater dynamic range compared to float16.

Comparison of the float32, bfloat16, and float16 numerical formats — Cerebras

3. Float32 to Int8 Quantization:

This quantization type poses more challenges due to the limited range of representable values in int8 compared to float32. The essence here is to carefully project the float32 value range onto the int8 space to ensure precision is maintained.

One big challenge for representing weights using lower precision is the smaller numerical range an INT8 can represent, as shown below:

Types of Quantization Strategies

1.Post-Training Quantization
2.Quantization-Aware Training

1. Post-Training Quantization (PTQ)

It involves the quantization of a trained model after the completion of its training phase. By reducing the precision of model parameters, typically from 32-bit floating-point representation to 8-bit integers, PTQ offers alluring benefits such as reduced memory consumption, faster inference times, and improved energy efficiency. However, PTQ often comes at the cost of model accuracy due to the mismatch between the original model and its quantized counterpart.

GGML vs GPTQ
GGML and GPTQ are both quantized models designed to reduce model complexity and computational requirements by using lower-precision model weights. Here’s a brief comparison of these two approaches:

Optimization Targets: GGML models are optimized for CPU performance, making them faster on CPUs, while GPTQ models are tailored for GPUs, delivering faster inference on GPU hardware.
Inference Quality: Inference quality is believed to be similar for both GGML and GPTQ models, but some reports suggest GPTQ might perform slightly lower in specific scenarios.
Model Size: GGML models tend to be slightly larger than GPTQ models, which is an important consideration for resource requirements.
Compatibility: Both GGML and GPTQ models work seamlessly with Hugging Face Transformers, simplifying their integration into NLP tasks.

Choosing the Right Model:

If you have a CPU without an Nvidia GPU, GGML is recommended.
If you have an Nvidia GPU (even if it’s not the most powerful), GPTQ is a suitable choice.

2. Quantization-Aware Training (QAT)

A technique that refines the PTQ model to maintain accuracy even after quantization. Unlike PTQ, where quantization is applied as a separate step, QAT incorporates quantization during the training process itself. By incorporating quantization-related operations (scaling, clipping, and rounding) into the training process, QAT optimizes model weights to mitigate the potential accuracy loss associated with quantization.

The remarkable aspect of QAT is that it eliminates the need for separate calibration after the training process. The model undergoes calibration as part of training, allowing it to effectively adapt to the quantization constraints. Consequently, the model becomes ‘quantization-aware,’ ensuring that accuracy is preserved during real-world inference.

PTQ vs QAT — Q/DQ (quantize then de-quantize) — Source

QAT Mechanism

QAT employs a novel approach of utilizing ‘fake’ quantization modules, marked as Q/DQ (quantize then de-quantize), during training. This enables the model to acclimatize to low-precision weights and account for calculation errors inherent to quantization. The loss function in QAT fine-tunes the model by considering these errors, further enhancing the model’s ability to maintain accuracy post-quantization.

1. Naive Quantization

Naive quantization involves applying uniform quantization to all operators, leading to a uniform drop in model accuracy. While easy to implement, this method does not account for varying sensitivities of different layers to quantization errors.

2. Hybrid Quantization

Hybrid quantization strikes a balance by quantizing some operators to INT8 precision while leaving others in higher precision (FP16 or FP32). Achieving this balance requires prior knowledge of the model’s sensitivity to quantization. Despite the challenge of identifying quantization-sensitive layers, hybrid quantization offers better accuracy and latency compared to naive quantization.

3. Selective Quantization

It quantizes specific operators to INT8 precision, employing diverse calibration methods and granularities (per channel or per tensor). This approach accommodates layers that thrive in higher precision due to sensitivity, as well as those that excel with INT8 precision. By offering the flexibility to tailor quantization parameters to different parts of the network, selective quantization maximizes accuracy and minimizes latency simultaneously.

🎯 Advantages of Quantization:

Quantization offers several compelling benefits, making it a cornerstone in neural network optimization:

1. Trimmed Memory Consumption:
Quantized models require significantly less memory storage, a crucial advantage for deployment on devices with restricted memory capacity.

2. Reduced Energy Consumption:
Theoretically, quantized models may consume less energy due to reduced data movement and storage operations, contributing to sustainability.

3. Turbocharged Inference:
Integer arithmetic is generally faster than floating-point arithmetic, leading to speedup in operations like matrix multiplications and boosting computational efficiency.

4. Embedding in Limited Devices:
Many embedded devices only support integer data types. Quantization paves the way for deploying models on such devices that lack native floating-point support.

❗ Navigating Quantization Challenges:

Quantization is not without its challenges:

1. Overflow and Underflow:
Careful scaling and clipping are essential to prevent quantized values from causing overflow or underflow issues.

2. Symmetric vs. Affine Quantization:
The choice between symmetric and affine quantization impacts the arithmetic operations and precision of the quantized model.

3. Per-Tensor vs. Per-Channel Quantization:
The granularity of quantization parameters can be varied to balance accuracy and memory requirements, adding complexity to the process.

Practical Implementation Steps:

For successful quantization, follow these steps:

1. Select Quantization-Prone Operators:
Identify operators with high computation demands, like matrix multiplications.

2. Dynamic Quantization Trial:
Test dynamic quantization for speed; if satisfactory, stop here.

3. Static Quantization Experimentation:
For improved speed, apply post-training static quantization with observers.

4. Calibration Technique Choice:
Opt for a suitable calibration technique like min-max, moving average min-max, or histogram.

5. Model Conversion:
Remove observers and convert float32 operators to int8 counterparts.

6. Quantized Model Evaluation:
Check if accuracy meets the requirements; if not, consider quantization-aware training.

Conclusion:

In the realm of large language models (LLMs), the integration of quantization techniques has emerged as a pivotal enabler. Originally devised to enhance efficiency on constrained devices, quantization has now evolved to make fine-tuned LLMs more accessible to diverse users. Quantization emerges as a potent technique in the deep learning area, enabling the deployment of sophisticated models on devices with limited resources. So that its possible for us to efficiently finetune the LLM like Llama with single GPU in classical colab environment. By understanding the nuances of quantization types, challenges, and practical implementation steps, we can harness its power to achieve a harmonious balance between computational efficiency and model accuracy.

Stay tuned for another interesting article in the LLM series. To receive updates on future articles, please subscribe, keep learning, and keep rocking!

🛠️ Helpful Resource

Connect with me on Linkedin
Find me on Github
Visit my technical channel on Youtube
Support: Buy me a Cofee/Chai