Introduction to Model Quantization

5 min readNov 29, 2023

Reducing the size of deep learning Models with 8-bit quantization

Quantization is a technique used to reduce the size and memory footprint of neural network models. It involves converting the weights and activations of a neural network from high-precision floating-point numbers to lower-precision formats, such as 16-bit or 8-bit integers. This can significantly reduce the model size and memory requirements, making it easier to deploy on edge devices with limited compute and memory resources.

A Simple Image to show high level overview of Quantization but in reality some algorithms are used for quantization process (Here simple round-off method is used)

Floating Point Representation:

Among various data types, floating point numbers are predominantly employed in deep learning due to their ability to represent a wide range of values with high precision. Typically, a floating point number uses n bits to store a numerical value. These n bits are further partitioned into three distinct components:

Sign: The sign bit indicates the positive or negative nature of the number. It uses one bit where 0 indicates a positive number and 1 signals a negative number.
Exponent: The exponent is a segment of bits that represents the power to which the base (usually 2 in binary representation) is raised. The exponent can also be positive or negative, allowing the number to represent very large or very small values.
Significand/Mantissa: The remaining bits are used to store the significand, also referred to as the mantissa. This represents the significant digits of the number. The precision of the number heavily depends on the length of the significand.

To understand this better, let’s delve into some of the most commonly used data types in deep learning: float32 (FP32) and float16 (FP16):

FP32 is often termed “full precision” (4 bytes), while FP16 are “half-precision” (2 bytes). But could we do even better and store weights using a single byte? The answer is the INT8 data type, which consists of an 8-bit representation capable of storing ²⁸ = 256 different values. In the next section, we’ll see how to convert FP32 weights into an INT8 format.

8-bit Quantization :

In this section, we will implement two quantization techniques: a symmetric one with absolute maximum (absmax) quantization and an asymmetric one with zero-point quantization. In both cases, the goal is to map an FP32 tensor X (original weights) to an INT8 tensor X_quant (quantized weights).

With absmax quantization, the original number is divided by the absolute maximum value of the tensor and multiplied by a scaling factor (127) to map inputs into the range [-127, 127]. To retrieve the original FP16 values, the INT8 number is divided by the quantization factor, acknowledging some loss of precision due to rounding.

For instance, let’s say we have an absolution maximum value of 3.2. A weight of 0.1 would be quantized to round(0.1 × 127/3.2) = 4. If we want to dequantize it, we would get 4 × 3.2/127 = 0.1008, which implies an error of 0.008.

With zero-point quantization, we can consider asymmetric input distributions, which is useful when you consider the output of a ReLU function (only positive values), for example. The input values are first scaled by the total range of values (255) divided by the difference between the maximum and minimum values. This distribution is then shifted by the zero-point to map it into the range [-128, 127] (notice the extra value compared to absmax). First, we calculate the scale factor and the zero-point value:

Then, we can use these variables to quantize or dequantize our weights:

Let’s take an example: we have a maximum value of 3.2 and a minimum value of -3.0. We can calculate the scale is 255/(3.2 + 3.0) = 41.13 and the zero-point -round(41.13 × -3.0) — 128 = 123 -128 = -5, so our previous weight of 0.1 would be quantized to round(41.13 × 0.1 -5) = -1. This is very different from the previous value obtained using absmax (4 vs. -1).

In TensorFlow, there are two ways to perform quantization :

I have a age and gender prediction model which size is 18.1 MB. Now I am applying both methods to see the size of my model.

Code implementation of Post-Training Quantization :

# Loading the saved model
import tensorflow as tf
from tensorflow.keras.models import load_model
model=load_model('age_gender_detection.h5')

# Performing Quantization
converter=tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations= [tf.lite.Optimize.DEFAULT]
tflite_quantized_model=converter.convert()

# Saving the quantized model in .tflite format
with open("tflite_quant_model.tflite","wb") as f:
    f.write(tflite_quantized_model)

Code implementation of Quantization-Aware Training :

# Loading the saved model
import tensorflow as tf
from tensorflow.keras.models import load_model
model=load_model('age_gender_detection.h5')

# Performing Quantization
import tensorflow_model_optimization as tfmot
quantize_model = tfmot.quantization.keras.quantize_model
q_aware_model=quantize_model(model)

q_aware_model.compile(loss=['binary_crossentropy', 'mae'], optimizer='adam', metrics=['accuracy'])
q_aware_model.summary()

# Performing Fine Tuning
q_aware_model.fit(x=X, y=[y_gender, y_age], batch_size=32, epochs=2, validation_split=0.2)

converter=tf.lite.TFLiteConverter.from_keras_model(q_aware_model)
converter.optimizations= [tf.lite.Optimize.DEFAULT]
tflite_qaware_model=converter.convert()

# Saving the model in .tflite format
with open("tflite_qaware_model.tflite","wb") as f:
    f.write(tflite_qaware_model)

Size comparison after Quantization :

You can see in above picture that after performing quantization size of the model is reduced. While post-training quantization effectively reduces the model size, quantization-aware training achieves a better balance between model size and performance, resulting in a slightly larger model with superior accuracy.

References :

I hope this article helps you to understand the concept of Quantization in deep learning. Thank you for reading this article!