Exploring quantization in Large Language Models (LLMs): Concepts and techniques

Published in

Data Science at Microsoft

10 min readAug 20, 2024

Large Language Models (LLMs) such as GPT have transformed natural language processing (NLP), with GPT-3 featuring an impressive 175 billion parameters. The significant computational power required to run these models, often involving multiple GPUs, necessitates methods to reduce these demands while preserving model performance. Technologies like quantization and distillation have been developed to shrink model sizes. Quantization, in particular, is a key technique for enhancing efficiency without greatly compromising performance. In this article, I explore what quantization is, how it works in the context of LLMs, and its practical applications.

What is quantization?

Quantization is a technique used in Machine Learning to optimize neural network models by reducing the precision of their parameters (weights and activations). This involves converting parameters from high-precision formats, like 32-bit floating point (FP32), to lower-precision formats, such as 8-bit integers (INT8). The primary objectives of quantization are to lower the model’s demand for computation and memory and to improve inference speed and efficiency — while also striving to retain the model’s original accuracy as much as possible.

Why quantization is important?

Quantization helps in four main areas:

Efficiency: Quantization reduces the amount of memory required to store the model weights, making it feasible to run large models on devices with limited resources.
Speed: Quantization enables lower precision computations to be performed faster, leading to quicker inference times.
Energy consumption: Quantized models consume less power, which is critical for deploying models on mobile and edge devices.
Cost: Reduced computational and memory requirements can lower the cost of deploying and running models on cloud infrastructure.

Uses and advantages of quantization

Quantization also confers five significant benefits:

Deployment on edge devices: Quantized models can be deployed on devices with limited computational and memory resources, such as smartphones and IoT devices.
Real-time applications: Faster inference times enabled by quantization make it suitable for real-time applications such as speech recognition and language translation.
Cost-effective deployment: Reduced computational and memory requirements translate to lower operational costs in data centers.
Energy efficiency: Lower power consumption makes quantization ideal for battery-powered devices and sustainable computing.
Scalability: Quantization allows for the deployment of large language models in resource-constrained environments, enabling broader accessibility and usage.

Frequently used data types in AI and Machine Learning

The selection of a data type determines the computational resources needed, influencing the model’s speed and efficiency. In Deep Learning, achieving a balance between precision and computational performance is essential because higher precision generally requires more computational resources.

Floating-point numbers are frequently used in Deep Learning because they can represent a wide range of values with high precision. A floating-point number is made up of bits that store a numerical value, divided into three main components:

Sign: This bit indicates if the number is positive or negative, with 0 for positive and 1 for negative.
Exponent: This part represents the power to which the base (usually 2 in binary) is raised, allowing for very large or small values.
Significand/mantissa: The remaining bits store the significant digits of the number. The precision of the number is largely determined by the length of the significand.

The size of a Large Language Model (LLM) is influenced by the number of its parameters and their precision, typically represented in formats like float32 (FP32), float16 (FP16), or bfloat16 (BF16).

Float32 (FP32): The IEEE (Institute of Electrical and Electronics Engineers) 32-bit floating-point format includes eight bits for the exponent, 23 bits for the mantissa, and one bit for the sign. It provides high precision but has a significant computational and memory footprint.

Float16 (FP16): This format has five bits for the exponent and 10 bits for the mantissa, resulting in a narrower range compared to FP32. This can increase the risk of overflow with large numbers and underflow with small numbers.

bfloat16 (BF16): This format uses eight bits for the exponent and seven bits for the mantissa, offering a broader range than FP16 and reducing underflow and overflow risks. Despite having fewer significand bits and slightly lower precision, BF16 generally does not significantly impact model performance, making it suitable for Deep Learning tasks.

In Machine Learning, FP32 is referred to as full precision (four bytes), while BF16 and FP16 are known as half precision (two bytes). The int8 (INT8) data type uses an eight-bit representation, allowing for 256 different values, ranging from [0, 255] for unsigned integers or [-128, 127] for signed integers.

Consider a 500-million parameter LLM. Typically, weights are stored in FP32 (32-bit). The memory footprint of this model can be calculated as follows:

For FP32: 500 million params × 4 bytes = 2.0 gigabytes
For INT8: 500 million params × 1 byte = 0.5 gigabytes

Converting the weights to FP16 would reduce the model’s size to just a quarter of the original. This reduction decreases memory usage and improves inference speed, though it may slightly affect accuracy. Additionally, some of these more compact models can be effectively managed by a CPU (central processing unit).

How does quantization work?

Quantization involves converting the continuous (floating-point) values of a neural network’s parameters to discrete (integer) values to reduce the model’s computational and memory footprint. Here’s a step-by-step guide on how to perform quantization, including an example.

Steps to perform quantization

Quantization operates by converting a range of continuous values into a smaller set of discrete values. This process can be shown through the following steps:

Range determination: Determine the range of values for the weights and activations. This can be done using min-max values or statistical methods such as mean and standard deviation.
Scale and zero-point calculation: Calculate a scaling factor and zero point to map the continuous values to the discrete integer values. The scaling factor determines the step size between discrete values, while the zero point aligns the scale with the original data distribution.
Quantization: Apply the scaling factor and zero point to convert the continuous values to discrete integer values.

Here are two methods to perform quantization, symmetric linear quantization with unsigned eight-bit integers and asymmetric linear quantization with unsigned eight-bit integers.

Symmetric linear quantization with unsigned eight-bit integers

The range of the quantized values is symmetric around zero. Both positive and negative values are scaled equally.

Suppose we have weights ranging from 0.0 to 1000.00. Let’s examine how these weights are quantized for unsigned eight-bit integers.

1. Quantization parameters:

Floating-point range: [0.00, 1000.00]
Quantized range: [0, 255] (for unsigned eight-bit integers)

2. Calculate the scale factor:

The scale factor maps the range of floating-point values to the range of quantized integer values. For symmetric linear quantization:

where:

Max is the maximum value in the floating-point range (1000.00)

Min is the minimum value in the floating-point range (0.00)

QMax is the maximum value in the quantized range (255)

QMin is the minimum value in the quantized range (0)

Compute the scale factor:

3. Quantize a floating-point value:

To convert a floating-point value X to its quantized integer representation Q:

Here, Min is used to adjust the value such that the range starts from zero.

Example calculations:

For X=500

For X= 0.00

Asymmetric linear quantization with unsigned eight-bit integers

The range of the quantized values is symmetric around zero. Both positive and negative values are scaled equally.

Suppose we have weights ranging from -20.0 to 1000.00. Let’s examine how these weights are quantized for unsigned eight-bit integers.

Determine the quantization parameters:

Floating-point range: [-20.0, 1000.00]
Quantized range: [0, 255] (for unsigned eight-bit integers)

Calculate the scale and zero point:

Scale (S): The scale factor maps the floating-point range to the quantized range.

where:

Max is the maximum value in the floating-point range (1000.00)

Min is the minimum value in the floating-point range (-20.00)

QMax is the maximum value in the quantized range (255)

QMin is the minimum value in the quantized range (0)

Substituting these values:

Calculate the zero point:

In asymmetric quantization, the zero point (offset) is computed to align the minimum floating-point value with the minimum quantized value:

Substitute the values:

Quantize a floating-point value:

To convert a floating-point value X to its quantized integer representation Q:

Example calculations:

For X = 1000.00

For X = -20.00

Two types of LLM quantization

Two varieties of LLM quantization include post-training quantization (PTQ) and quantization-aware training (QAT).

Post-training quantization (PTQ)

Quantization is performed after the model has been fully trained.
It involves converting weights and potentially activations from higher precision to lower precision. Common methods include static and dynamic quantization.

Quantization-aware training (QAT)

The model is trained with quantization considerations from the start.
During training, the model simulates lower precision operations, enabling it to adapt to the effects of quantization.
This typically offers better performance than PTQ, as the model learns to reduce quantization errors during training.

The bitsandbytes library in Python is a tool designed to optimize and enhance the performance of large language models (LLMs). It focuses on quantization and efficient computation techniques, which are crucial for handling the massive computations involved in LLMs and the bitsandbytes library is primarily designed to work with PyTorch.

Techniques for LLM quantization

Four LLM quantization techniques include quantized low-rank adaptation (QLoRA), general pre-trained transformer quantization (GPTQ), Georgi Gerganov Machine Learning / GPT-generated unified format (GGML/GGUF), and sparse quantized representations (SpQR).

QLoRA (quantized low-rank adaptation)

Overview: QLoRA combines low-rank adaptation (LoRA) with quantization. LoRA fine tunes a small set of additional weights (adapters) while freezing the original weights. QLoRA further reduces memory requirements by quantizing these weights to four-bit precision.

Mechanisms:

NF4 (NormalFloat four-bit): A four-bit data type that normalizes weights to the range [-1, 1] for improved precision compared to standard four-bit floats.
Double quantization (DQ): Applies a second round of quantization to the scaling factors of weight blocks to reduce memory usage further. Scaling factors are quantized from 32-bit to eight-bit, saving significant memory in large models.

Advantages: Significant memory reduction, feasible to run large models on single GPUs.

Disadvantages: Complexity in implementation, potential loss of precision.

GPTQ (general pre-trained transformer quantization)

Overview: GPTQ is designed to reduce model size by applying layer-wise quantization, optimizing quantized weights to minimize output error.

Mechanisms:

Layer-wise quantization: Quantizes the model one layer at a time, adjusting weights in batches and minimizing the mean squared error (MSE) between the original and quantized layers.
INT4/FP16 mixed precision: Uses four-bit integers for quantized weights while maintaining activations in 16-bit float (FP16) precision. Weights are dequantized during inference for computation in FP16.

Advantages: Efficient for models running on GPUs, maintains high precision during inference.

Disadvantages: May require additional computational resources during the quantization process.

GGML/GGUF (Georgi Gerganov Machine Learning / GPT-generated unified format)

Overview: GGML quantizes models to run efficiently on CPUs. GGUF is an updated format that extends GGML’s capabilities to include non-Llama models and is more extensible.

Mechanisms:

k-Quant System: Divides model weights into blocks and quantizes them using various bit-width methods depending on importance (e.g., q2_k, q5_0, q8_0).
GGUF: Extends GGML to support a broader range of models and is backward-compatible.

Advantages: Optimized for CPU execution, supports a wide range of models.

Disadvantages: Less suited for GPU execution, potentially slower inference compared to GPU-optimized methods.

SpQR (sparse quantized representations)

Overview: SpQR combines sparsity with quantization to optimize model size and performance.

Mechanisms:

Sparse representation: Introduces sparsity by pruning non-essential weights, reducing the number of weights to be quantized.
Quantization: Applies quantization to the sparse weights to further reduce memory and computation.

Advantages: Reduces both the number of weights and their precision, enabling potentially significant memory savings.

Disadvantages: May lead to reduced model performance if too many weights are pruned or quantized too aggressively.

Comparison of quantization techniques

Conclusion

Quantization is a powerful technique to optimize Large Language Models, making them more efficient and practical for real-world applications. By reducing model size, increasing inference speed, and lowering power consumption, quantization enables the deployment of LLMs on a wide range of devices and platforms. As LLMs continue to evolve and grow in size, techniques like quantization are likely to play a critical role in ensuring their accessibility and usability.

Karthikeyan Dhanakotti is on LinkedIn.

Exploring quantization in Large Language Models (LLMs): Concepts and techniques

What is quantization?

Why quantization is important?

Uses and advantages of quantization

Frequently used data types in AI and Machine Learning

How does quantization work?

Steps to perform quantization

Symmetric linear quantization with unsigned eight-bit integers

Asymmetric linear quantization with unsigned eight-bit integers

Two types of LLM quantization

Post-training quantization (PTQ)

Quantization-aware training (QAT)

Techniques for LLM quantization

QLoRA (quantized low-rank adaptation)

GPTQ (general pre-trained transformer quantization)

GGML/GGUF (Georgi Gerganov Machine Learning / GPT-generated unified format)

SpQR (sparse quantized representations)

Comparison of quantization techniques

Conclusion

Written by Karthikeyan Dhanakotti