Quantization in practice | Machine Learning

7 min readDec 21, 2022

Usage of various neural networks in machine learning has been accompanied by huge parameters for storage and computation. These parameters have led to an increase in the hardware cost and other challenges, which has forseen many datascientist employ different compression approaches to design efficient accelerators. Quantization is one of the most important approached being employed for deep neural networks compression.

Datascientists leverage the power of quantization for various optimization reasons, which I will cover in this blog-post. Below is an overview of what I will cover.

What is quantization?
Why is quantization important?
Example using Kaggle Shoes data
Data augmentation
Training the model using post-training quantization
Implementing the MobilenetV2 network
Retraining the model with quant-aware training
Result for both models

What is quantization?

Quantization includes conversion techniques used for performing computation and storing tensors at lower bit-widths than floating point precision. For example, instead of using 32-bit floating point numbers to represent the parameters of a neural network, we might use 8-bit integers. This can significantly reduce the memory and storage requirements of a model, as well as the amount of computation required to perform inference with the model.. Commonly used forms of quantization are:

Post-training quantization: A conversion technique employd to reduce the size of a model thereby improving CPU and hardware accelerator latency, with little degradation in model accuracy.
Quantization aware training:This form emulates inference-time quantization, creating a model that downstream tools will use to produce actually quantized model.

Why is quantization important?

Below follows some of the reasons that have led to the increased implementation of quantization in neural networks(NN models) in the last few years.

Computation Efficiency:Quantization can have a big impact on efficiency, especially when deploying machine learning models on resource-constrained devices such as smartphones or embedded systems. By using fewer bits to represent model parameters, we can reduce the amount of memory needed to store the model, and the amount of computation required to perform inference with the model. This can make it possible to run machine learning models on devices that might not have sufficient resources to run models with floating point parameters.
Overcoming sparsity in the networks:Quantization can help overcome sparsity in a neural network by reducing the amount of memory and computation required to store and process the model. This can make it easier to deploy the model on resource-constrained devices and can improve the efficiency of the model in general.
Help run models on small devices such as Raspberry Pi:One way to optimize a machine learning model for deployment on Raspberry Pi is to use quantization. By quantizing the model’s parameters and activations, we can represent the model using fewer bits, which can significantly reduce the memory and storage requirements of the model. This can make it possible to deploy the model on Raspberry Pi even if it is too large to fit in the device’s memory or if it requires too much computation to run.
It can help run models on-device rather than on-cloud:When a machine learning model is deployed on-cloud, it is run on a remote server or cluster of servers that have access to a large amount of memory and computation resources. This can be convenient because it allows users to access the model from anywhere with an internet connection. However, it can also be expensive and slow because it requires sending data back and forth over the internet.

On the other hand, when a machine learning model is deployed on-device, it is run locally on the device itself. This can be faster and more cost-effective than running the model on the cloud because it eliminates the need to send data back and forth over the internet.

Example using Kaggle Shoes data

In this example, we will learn how to classify shoes using the MobileNetV2 model and the Kaggle shoes image dataset with post-training quantization. We will also show how to incorporate easy data augmentation(EDA) and augmentation into the model training process to improve the model’s performance. The full notebook for this example can be found on Kaggle, using the link below.

Shoes Classification Using Mobilenetv2

Explore and run machine learning code with Kaggle Notebooks | Using data from Shoes Classification Dataset | 13k Images…

www.kaggle.com

First, we will load the Kaggle shoes image dataset and preprocess the data using the `ImageDataGenerator` class from TensorFlow’s `keras.preprocessing.image` module. We will use the `flow_from_directory` method to generate training and test data generators that will automatically load and resize the images as needed.

# Load the Kaggle shoes image dataset
data_dir = '/kaggle/input/shoes-classification-dataset-13k-images/Shoes Dataset/'
train_dir = data_dir + 'Train/'
test_dir = data_dir + 'Test/'
val_dir = data_dir + 'Valid/'

Also remember to specify an experimental Tensorflow lite set operation, which is achieved by using the code below.

tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8

Data augmentation is a technique used to increase the diversity of a dataset by generating new data samples from existing ones. This is typically done by applying various transformations to the existing data, such as rotating an image, adding noise to a sound signal, or translating a text sentence. The goal of data augmentation is to create a larger and more diverse dataset, which can be used to train machine learning models that are more robust and generalize better to unseen data. Data augmentation can be particularly useful when the original dataset is small or lacks diversity, as it can help to reduce overfitting and improve the performance of the model on the test set.

To use data augmentation in our data, you will need to first define the augmentation parameters. Here how you can do it.

# Define data augmentation parameters
data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal",input_shape=(image_width,image_height,3)),
        layers.RandomRotation(0.1),
        layers.RandomZoom(0.1),
    ]
)

Then we will load the training and testing data with the augmentation params.

The train_generator is used to generate augmented training data by reading images from the train_dirdirectory, resizing them to the specified target size (in this case, 224x224 pixels), and applying any augmentation transformations that were specified when the ImageDataGeneratorobject was created. The train_generatoryields batches of augmented data that can be used to train a machine learning model.

train_dataset = tf.keras.utils.image_dataset_from_directory(
                  train_dir,
                  shuffle=True,
                  batch_size=BATCH_SIZE,
                  image_size=(image_height, image_width))

validation_dataset = tf.keras.utils.image_dataset_from_directory(
                  val_dir,
                  shuffle=True,
                  batch_size=BATCH_SIZE,
                  image_size=(image_height, image_width))

Implementing the MobilenetV2 network

MobileNetV2 is a convolutional neural network (CNN) model that was trained on the ImageNet dataset and is capable of performing image classification tasks with high accuracy. The Kaggle shoes image dataset contains a large number of images of shoes in various categories, such as sneakers, boots, and sandals. By using MobileNetV2 and the Kaggle shoes image dataset, we can train a model to classify shoes accurately.

Now, we will load the MobileNetV2 model, which has already been trained on the ImageNet dataset.

#create the base model from the pre-trained model MobileNet V2
IMG_SHAPE = (image_height, image_width) + (3,)
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
                                               include_top=False,
                                               weights='imagenet')

#let us see what the feature extractor does to a sample batch of images
image_batch, label_batch = next(iter(train_dataset))
feature_batch = base_model(image_batch)
print(feature_batch.shape)

First, we will enable QAT by setting the experimental_run_tf_function attribute to True:

# Enable QAT by setting the `experimental_run_tf_function` attribute to True
model.experimental_run_tf_function = True

The compile the model using an optimizerand lossfunction:

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Train the model using the fit_generator method and the train_geneator method:

# Train the model
history = model.fit_generator(
    train_generator,
    steps_per_epoch=len(train_generator),
    epochs=10,
    validation_data=test_generator,
    validation_steps=len(test_generator))

Quantize the model using the tf.quantization.quantize function:

# Quantize the model
model = tf.quantization.quantize(model, quant_delay=1000)
# Save the quantized model
model.save('quantized_model.h5')

Training the model using post-training quantization

#convert to a quantized tf.lite model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()
open('/kaggle/working/trainedshoes.tflite', 'wb').write(quantized_tflite_model)

In the code above we first load our model using tf.keras.models.load_mondel ,then we create a TFLiteConverter object from the keras model using `tf.lite.TFLiteConverter.from_keras_model`. We then set the optimization mode of the converter to DEFAULT which enables post training quantization. Which is achieved by using this piece of code converter.optimizations = [tf.lite.Optimize.DEFAULT] . The mode is then converted to a tflite model and save it. Once we have our quantized TensorFlow Lite model we can deploy it on a mobile device.

Quantization Aware Training

def apply_quantization_to_dense(layer):
    if isinstance(layer, tf.keras.layers.Dense):
        return tfmot.quantization.keras.quantize_annotate_layer(layer)
    return layer

We use the function above to apply quantization to the dense layers of the keras model. The function takes the layer as the input and returns a quantized annotated layer.

annotated_model = tf.keras.models.clone_model(
    model,
    clone_function=apply_quantization_to_dense,
)

This code creates a new model with the same architecture as the original model,but now with the dense layers quantized.

quant_aware_model = tfmot.quantization.keras.quantize_apply(annotated_model)

Final we use the code above to make the model quantization-aware. This modifies the model’s layers and adds quantization-aware training(QAT) operations to the model.

After training the model with quantization aware training I observed a significant improvement in the model’s accuracy. The accuracy moved from 0.9777 to 0.9848, which is an improvement of 0.0071.

Final Words

Generally, post-training quantization can be a useful technique for reducing the size and complexity of a machine learning model, while QAT can be used to improve the performance of a quantized model. However, the specific technique and parameters that you choose will depend on the characteristics of your dataset and the goals of your model. You may need to experiment with different configurations to find the combination that works best for your specific use case.

References

Shoes Classification Dataset | 13k Images |

Ballet Flat, Boat, Brogue, Clog and Sneaker Classification

www.kaggle.com

Data augmentation | TensorFlow Core

This tutorial demonstrates data augmentation: a technique to increase the diversity of your training set by applying…

www.tensorflow.org

Post-training quantization | TensorFlow Lite

Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware…