Model Quantization: Who wouldn’t want their models slimmer?

Nawin Raj Kumar S
kgxperience
Published in
4 min readFeb 14, 2024

Ever caught yourself wishing weight loss was as effortless as it seems for AI models? Well, while shedding those extra pounds might still be a bit of a struggle for us, our trusty algorithms are here to make it look like a breeze. Just as we yearn for a trimmer lifestyle, developers crave lightweight ML/DL models, primed to gracefully dance on the stages of tiny microcontrollers and edge devices, promising swifter computations and superior performances. Because let’s face it, in the world of artificial intelligence, being light on your digital feet is the new black. And one of the ways to achieve this is “Quantization”.

So what’s behind the process of quantization? Obviously, our model is not gonna be working out and be on a diet like our guy above. To understand this let’s see what weights is all about.

Those numbers you see within the neurons are called as weights. These weights are basically numbers of sizes 64 bytes, 32 bytes, 16 bytes and 8 bytes, and of data types float, unit and int based on the model configuration. The data type in which the model stores its weights along with its byte size combined is called the precision of the model. Usually, models are stored at float64 to get the maximum accuracy. But, one major drawback behind this precision is the model will consume a lot of storage as the size of each neuron is 64 bytes. Let’s say we have a 3-layer neural network with 5 neurons on the input layer, 3 on the second and 1 on the output layer. In total, we have 9 neurons of 64 bytes each. Therefore, the size of the model will be:

9 neurons x 64 bytes = 576 bytes / 1024 = 0.5625 Mega bytes.

This might seem like a small quantity but for larger models with millions of parameters, this would be a larger value. Let’s quantize the model to 32 bytes and recalculate the size of the model.

9 neurons x 32 bytes = 288 bytes / 1024 = 0.28125 Mega bytes.

Technically, the model size has been reduced to 50%. This is the power of quantization. So, it’s a good thing right, we can have our model to the least precision possible and have better performances, right? Well, definitely not.

The more you quantize a model, the more you’ll lose the accuracy of the model. However, there are ways to prevent the loss of accuracy:

  • Using QAT (Quantization-aware training)
  • Using Representative datasets

Quantization-aware training

Quantization-aware training is a process where the model will be trained with an emulated quantized weight which makes the model aware that the model will be taking inference using the quantized weights. In a nutshell, QAT is like giving your model a crash course in surviving the quantization process. It’ll be, the below picture will provide a better explanation I hope.

Representative Datasets

Representative Dataset is a subset of your actual dataset which will contain different variants of the actual dataset. Representative datasets are actually used as a reference point which can be used to quantize the model without actually losing much of the trained information inside the model.

Tools Used for Quantization

  • In Tensorflow, we have a specified framework called Tensorflow-lite where we can deploy lightweight, highly accurate quantized models without losing much of the accuracy. TF-Lite provides many types of quantization
  1. Post-training dynamic range quantization
  2. Post-training full integer quantization
  3. Post-training float-16 quantization

This is a sample code to quantize a model to 8-bit integer precision

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

Note: Integer quantization in Tensorflow-lite requires a representative dataset to quantize the model. So the above code would look like this:

import tensorflow as tf

def representative_dataset_gen():
for _ in range(num_calibration_steps):
# Get sample input data as a numpy array in a method of your choosing.
yield [input]

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
tflite_quant_model = converter.convert()

Apart from Tflite, we can quantize models from the respective frameworks in which the models are created such as Pytorch and Onnxruntime. Apart from these techniques, there are also 3-bit precision techniques that will be used to quantize LLMs.

Thank You

--

--