Quantization vs Distillation in Neural Networks: A Comparison

5 min readNov 11, 2023

A dive into the techniques of quantizing and distilling deep learning models: What are they and how do they differ?

Deep learning models, especially those with vast parameters, pose challenges for deployment in resource-constrained environments. Two popular techniques, quantization, and distillation, address this issue, aiming to make these models more lightweight without compromising too much on performance. But what do they entail, and how do they compare?

I’ll use the metaphor of a book and its author to simplify the topics.

Quantization: Precision for Efficiency

Quantization is all about numeric precision. By reducing the bit-width of weights and activations in a model, one can shrink the model size, potentially increasing inference speed.

Neural networks have interconnected neurons, each with weights and biases that are tuned during training. These parameter values, along with neuron activations, are typically stored in 32-bit floats, which provide precision but take up a lot of memory. For example, a 50-layer ResNet requires 168MB to store 26 million 32-bit weight values and 16 million 32-bit activation values.

Quantization aims to reduce this memory footprint by using lower bandwidths like 8-bit integers to represent both weights and activations. This introduces quantization error but allows the storage of 4x more values per bit. The goal is to balance this tradeoff between precision and memory usage. Advanced techniques like per-channel quantization, stochastic rounding, and re-training can minimize the impact on model accuracy.

The two most common quantization cases are float32 -> float16 and float32 -> int8.

From: https://arxiv.org/pdf/2103.13630.pdf

Math Behind Quantization:

This formula provides a straightforward and computationally efficient method for converting real numbers into quantized integers, making it a popular choice in many quantization schemes.

How To Quantize a Machine Learning Model?

Imagine you have two different ways to shrink the size of a book (representing a neural network) so it’s easier to carry around:

Post-training Quantization: This is like writing the entire book using a regular pen and then, after you’re done, going back and rewriting it with a much finer pen to make it smaller. You don’t change anything about the story; you just make the letters smaller after you’ve finished writing. This is easier, but sometimes the smaller writing can be harder to read (meaning the accuracy of the neural network might drop).
Quantization-Aware Training: This is like writing your book with a fine pen from the start. As you write, you’re aware of how small the letters need to be, so you adjust your writing style as you go. This way, the final small version of the book is easier to read from the beginning because you’ve been planning for it all along (meaning the neural network is trained to work well with the smaller, quantized version from the start).

In both cases, the goal is to make the book (or neural network) smaller and more efficient without losing the essence of the story (or the network’s accuracy).

Pros

Reduced Model Size: Shifting from 32-bit floating points to 8-bit integers, for instance, can reduce the model size fourfold.
Speed and Hardware Compatibility: Low-precision arithmetic can be more rapid on specific hardware accelerators.
Memory Efficiency: Less data means reduced memory bandwidth requirements.

Cons

Accuracy Trade-offs: Lower precision can sometimes affect model performance.
Implementation Challenges: Quantization, particularly quantization-aware training, can be tricky.

Distillation: From Teacher to Student

Distillation involves training a smaller neural network, called the student, to mimic a larger pre-trained network, the teacher.

https://towardsdatascience.com/can-a-neural-network-train-other-networks-cf371be516c6

How to Distill a Machine Learning Model?

In broad terms, there are three types of categorization:

Offline Distillation: Imagine an aspiring author learning from an already published, successful book. The published book (the teacher model) is complete and fixed. The new writer (the student model) learns from this book, attempting to write their own based on the insights gained. In the context of neural networks, this is like using a fully trained, sophisticated neural network to train a simpler, more efficient network. The student network learns from the established knowledge of the teacher without modifying it.
Online Distillation: Here, envision an aspiring author and a seasoned author writing their books simultaneously. As the seasoned author develops new chapters (updating the teacher model), the new author also writes their chapters (updating the student model), learning from the experienced author as they go. Both books evolve concurrently, with each author’s work informing the other’s. In neural networks, this translates to simultaneously training both the teacher and student models, allowing them to learn and adapt together, enhancing the student model’s learning process.
Self-Distillation: In this scenario, the aspiring author is both a teacher and a student. They start writing a book with their current skill level. As they gain new insights and improve their writing, they revise their earlier chapters. This is self-teaching, where the author constantly refines their work based on their evolving understanding. In neural network terms, this method involves a single network learning and improving upon itself, using its more advanced layers or later stages of training to enhance its earlier layers or initial stages, effectively teaching itself to become more efficient and accurate.

Math Behind Distillation:

The objective in distillation is to minimize the divergence between the teacher’s predictions and the student’s predictions. The most commonly used measure for this divergence is the Kullback-Leibler divergence:

Pros

Size Flexibility: The student model’s architecture or size can be customized, offering a balance between size and performance.
Performance Retention: A well-distilled student can achieve performance close to its teacher, despite being more compact.

Cons

Retraining is a Must: Unlike quantization, distillation mandates retraining of the student model.
Training Overheads: Time and computational resources are needed to train the student model.

In Practice

Quantization often finds its place in hardware-specific deployments, while distillation is sought when one desires a lightweight model with performance close to a larger counterpart. In many scenarios, a combination of both — distilling a model and then quantizing it — can bring forth the benefits of both worlds. It’s essential to align the choice with the deployment needs, available resources, and acceptable trade-offs in terms of accuracy and efficiency.

Resources

A Survey of Quantization Methods for Efficient Neural Network Inference [ https://arxiv.org/pdf/2103.13630.pdf]
Knowledge Distillation: A Survey [https://arxiv.org/pdf/2006.05525.pdf]

Quantization vs Distillation in Neural Networks: A Comparison

Quantization: Precision for Efficiency

How To Quantize a Machine Learning Model?

Pros

Cons

Distillation: From Teacher to Student

How to Distill a Machine Learning Model?

Pros

Cons

In Practice

Resources

Written by Aaditya ura