What is Quantization and Distillation of Models ?

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

6 min readJul 7, 2023

BACKGROUND

LLMs (Large Language Models) is rapidly expanding, with an ever-increasing number of parameters being employed to accomplish crucial objectives:

Real-time latency
Low energy consumption
High accuracy

However, this growth in model size and complexity poses significant challenges for infrastructure and cloud hosting.

The two main issues with deploying LLM are :

Too big for hardware : Because the target edge devices do not have enough memory to store and run the model
Expensive data types :By default use 32-bit floating point data types to represent their computations. These large sizes and intensive computation requirements present substantial obstacles to achieving fast and efficient inference.

Why we need them ?

If we run our models in the cloud, we want to minimize the infrastructure costs and the carbon footprint. By reducing the size and/or complexity of LLMs while maintaining high accuracy.

Quantization and distillation are techniques used in machine learning to compress and optimize models, making them more efficient for deployment on resource-constrained devices or systems. Let’s look at each technique individually:

Quantization

Quantization refers to the process of reducing the precision or bit-width of the numerical values used to represent model parameters usually from 𝑛 bits to 𝑚 bits, where 𝑛 > 𝑚. In most deep learning models, parameters are typically stored as 32-bit floating-point numbers. However, using lower-precision data types, such as 8-bit integers, can significantly reduce memory requirements and improve inference speed. This is desirable because integer arithmetic is less complex than floating point arithmetic and thus faster to compute.

Floating point number vs Fixed point number

A) Floating point number

Neural network model weights and biases are typically represented using numeric values, specifically floating-point numbers. The most common format used is the IEEE 754 standard for floating-point representation.

32-bit (single precision) (Generally Used in Neural Networks)
64-bit (double precision)

32-bit and 64-bit Floating point number representation

The sign bit represents 0 for positive and 1 for negative, mantissa represents the significant digits of a floating-point number, while the exponent determines the scale of the number.

In the machine learning jargon FP32 is called full precision (4 bytes), while BF16 and FP16 are referred to as half-precision (2 bytes). On top of that, the int8 (INT8) data type consists of an 8-bit representation that can store 2⁸ different values (between [0, 255] or [-128, 127] for signed integers).

For example, the value -192 equals (-1)1 x 2⁷ x 1.5,

B) Fixed point number

It is also encoded using a sign bit and a mantissa, but it uses a single global fixed exponent value that is shared across all fixed point values.

int8 quantization has become a popular approach for such optimizations not only for machine learning frameworks like TensorFlow and PyTorch but also for hardware toolchains like NVIDIA® TensorRT and Xilinx® DNNDK — mainly because int8 uses 8-bit integers instead of floating-point numbers and integer math instead of floating-point math, reducing both memory and computing requirements.

Quantization involves two main steps:

Weight Quantization: In this step, the model’s weights, which are the learned parameters, are converted from higher-precision floating-point values (e.g., 32-bit) to lower-precision representations (e.g., 8-bit). This reduces the memory footprint required to store the model.
Activation Quantization: Quantizing the activations (intermediate outputs) of the model can further reduce memory and computational requirements. Activation quantization involves quantizing the input and output values of each layer, often using fixed-point representations.

How Quantization suffer precision when quantizing from floating point to integer?

Method of Quantizing

Quantization-aware training (QAT) :In this technique neural network parameters convert higher-precision floating-point models to lower-precision representations during the training phase . By exposing the model to quantization effects during training, it becomes more robust and better equipped to maintain accuracy even with reduced precision.
Post-training quantization (PTQ) : In this technique, after the training phase concludes, the neural network’s parameters are frozen, preventing further updates. The parameters are then quantized, resulting in a compressed model. This quantized model is utilized for inference without modifying the post-training parameters. PTQ is typically applied as a separate step after the model has been trained using traditional methods.

Benefits of Quantization

Memory Efficiency (By representing model parameters and activations with lower-precision data types, such as 8-bit integers)
Lower network latency(Lower-precision computations require fewer memory accesses and reduced computational complexity)
Better power efficiency(By using lower-precision representations, the communication bandwidth and power consumption can be significantly reduced)

By quantizing models, it’s possible to achieve a good balance between model size, inference speed, and accuracy. However, there is a trade-off between reduced precision and potential loss of model performance(precision vs. accuracy), as lower-precision representations may result in information loss and reduced accuracy.

Model Distillation

Model distillation, also known as knowledge distillation, is a technique where a smaller model, often referred to as a student model, is trained to mimic the behavior of a larger, more complex model, known as a teacher model. The goal is to transfer the knowledge and performance of the larger model to the smaller one.

The distillation process typically involves the following steps:

Teacher Model Training: The teacher model, which is typically a large and accurate model, is trained on a labeled dataset using conventional techniques, such as deep learning.
Soft Target Generation: During the training of the teacher model, instead of using hard labels (one-hot encoded vectors), soft targets are generated. Soft targets are probability distributions produced by the teacher model for each input example. These distributions contain more information than simple one-hot labels and provide a measure of the teacher model’s uncertainty.
Student Model Training: The student model, which is usually a smaller and more lightweight model, is trained to mimic the behavior of the teacher model using the soft targets as guidance. The student model is trained on the same labeled dataset, but it aims to replicate the output probabilities of the teacher model rather than directly predicting the hard labels.

The distillation process allows the student model to learn from the knowledge encoded in the soft targets, effectively transferring the teacher model’s knowledge. The resulting student model is smaller and faster than the teacher model while often maintaining a comparable level of performance.

Final thoughts

Both quantization and model distillation techniques are valuable in optimizing and deploying machine learning models in scenarios where resource constraints, such as limited memory or computational power, are a concern. They enable efficient model deployment on edge devices, embedded systems, or in situations where low-latency inference is critical.

Thank you for reading!🤗I hope that you found this article both informative and enjoyable to read.

Fore more information like this follow me on LinkedIn

References

Also, Read

Follow our Social Accounts- Facebook/Instagram/Linkedin/Twitter
Join AImonks Youtube Channel to get interesting videos.