Efficient inference optimizations and benchmark of the model using post-training quantization.

9 min readApr 30, 2024

Deep learning is becoming popular nowadays. It is being used in different applications such as classification, segmentation, pose estimation, augmented reality, and self-driving cars. The primary goal in deep learning applications is to achieve accuracy. This can be accomplished using big models but these complex models give rise to several issues in real-time applications. These real-time applications run on edge devices that have limited memory and computation resources resulting in the reduction of model inference performance.

Model inference can be improved using optimization techniques such as pruning, clustering, and quantization. Optimized models enable efficient use of memory and make computations simple, thereby resulting in the following advantages

Memory Usage: Using integer or low-bits for input, weights, activations, and output gives rise to less use of memory
Power Consumption: Less memory access and simpler computation reduce power consumption significantly
Latency: Less memory access and simpler computation also speed up the inference
Silicon Area: Integer or low-bits require less silicon area for computational hardware as compared to floating bits

In this project, TensorFlow Keras is used to develop and train the CNN model for classifying fire and non-fire datasets. The model is optimized using TensorFlow model optimization library. The optimized model is then converted into Tensorlite format. The performance of the Tensorlite model is benchmark on Android device using TensorFlow Lite benchmark tools. These tools measure several important performance metrics:

Initialization time
Inference time of warmup state
Inference time of steady state
Memory usage during initialization time
Overall memory usage

Optimization: Magnitude-based weight pruning

In magnitude-based weight pruning, model weights, having values less than a threshold, are made zero during the training process. It develops sparsity in the model which helps in model compression. We can also skip those zero weights during inference resulting in an improvement in latency. In unstructured pruning, individual weight connections are removed from a network by setting them to zero. In structured pruning, groups of weight connections are removed together, such as entire channels or filters. Unfortunately, structured pruning severely limits the maximum sparsity that limits both performance and memory improvements.

Optimization: Clustering

Clustering, also called weight sharing, helps to make models more memory-efficient by reducing the number of different weight values. In this process, the weights of each layer are grouped into clusters. Within each cluster, all model weights share the same value which is known as the centroid value of the cluster.

Clustering optimization on 4x4 weight matrix

Optimization: Quantization

In Quantization, the precision of the model weights, activation, input, and output is decreased by reducing the number of bits representing numerical values. Using lower precision such as FP16 or INT8 as compared to FP32, makes the model memory-efficient and helps in faster execution. In this technique, high-precision values are mapped into lower-precision values using quantization-aware training, post-training quantization, or hybrid quantization which is the combination of both. Quantization is helpful for deploying models in resource-constrained edge devices as it reduces computational and memory requirements with acceptable accuracy.

Symmetric linear quantization for integer calibration

Methodology

You can find the code for this tutorial from github repository. The base code for model training and dataset is provided in the blog at Pyimagesearch by Adrian Rosebrock. This project has contributed to the following:

Model training is modified for pruning, clustering, and collaborative optimizations.
Post-Training optimization
Benchmarking in Android device

Optimization and benchmark pipeline using Tensorflow and Android debugging bridge

Step1: Loading and preprocessing of the dataset

A dataset contains images of two classes (NonFire and Fire). The total size of the dataset is 4008 images and it is split up into training dataset and validation dataset. Following are the hyper parameters used for regular model training.

+---------------------+----------------------------------------------------+
|   Hyperparameters   |                       Value                        |
+---------------------+----------------------------------------------------+
| Dataset access      | Google Drive                                       |
| Classes             | 2 (nonFire,Fire)                                   |
| Batch Size          | 64                                                 |
| No. of Epochs       | 50                                                 |
| Train dataset       | (3006,128,128,3)                                   |
| Validation dataset  | (1002,128,128,3)                                   |
| Preprocessing       | Resize, normalization, one-hot encoding, and split |
+---------------------+----------------------------------------------------+

Step2: Model architecture and training

A deep neural network is used for classification. It contains convolutional layers, max pooling layers, and batch normalization layers. Adam optimizer is used for training with binary cross entropy loss function. It is trained for 50 epochs. Four different models are trained with different optimizations. The model architecture is given below

Model layers visualization with activation tensor values

Unoptimized Baseline Model:

In this model, training is done with no optimization. This model is used to compare with optimized versions. This model is converted into TensorLite format. The baseline model is developed using the following code:

converter1 = tf.lite.TFLiteConverter.from_keras_model(model)
baseline_tflite_model = converter1.convert()

Dynamic range quantization Model:

In this model, training is done without any optimization, while model is converted with dynamic range quantization during Tensorlite conversion process.

converter2 = tf.lite.TFLiteConverter.from_keras_model(model)
converter2.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter2.convert()

Float16 quantization Model:

In this model, training is done without any optimization, while model is converted with float16 quantization during Tensorlite conversion process.

converter3 = tf.lite.TFLiteConverter.from_keras_model(model)
converter3.optimizations = [tf.lite.Optimize.DEFAULT]
converter3.target_spec.supported_types = [tf.float16]
fquantized_tflite_model = converter3.convert()

Clustered and Quantized Model:

In this model, training is done with regular model then fine tuning is done with clustering optimization. In the end, the model is converted with dynamic range quantization during Tensorlite conversion process.

#Fine Tunning after regular training
cluster_weights = tfmot.clustering.keras.cluster_weights
CentroidInitialization = tfmot.clustering.keras.CentroidInitialization
clustering_params = {
  'number_of_clusters': 8,
  'cluster_centroids_init': CentroidInitialization.KMEANS_PLUS_PLUS,
  'cluster_per_channel': True,
}
clustered_model = cluster_weights(model, **clustering_params)
# Use smaller learning rate for fine-tuning
opt = tf.keras.optimizers.Adam(learning_rate=1e-5)
clustered_model.compile(loss="binary_crossentropy", optimizer=opt,
  metrics=["accuracy"])
clustered_model.summary()

After Fine-tuning: After fine-tuning we need to remove clustering tf.variables that are needed for clustering otherwise it will increase the model size.

stripped_clustered_model = tfmot.clustering.keras.strip_clustering(clustered_model)

After clustering, we will apply post-training quantization to get a quantized model.

Converter4 = tf.lite.TFLiteConverter.from_keras_model(stripped_clustered_model)
Converter4.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter4.convert()

Pruning and Quantized Model

In this model, training is done with a regular model then fine-tuning is done with pruning optimization. In the end, the model is converted with dynamic range quantization during Tensorlite conversion process. After Training, fine-tuning stage:

prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.ConstantSparsity(0.5, begin_step=0, frequency=100)
  }
callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep()
]

pruned_model = prune_low_magnitude(model, **pruning_params)

# Use smaller learning rate for fine-tuning
opt = tf.keras.optimizers.Adam(learning_rate=1e-5)

pruned_model.compile(loss="binary_crossentropy", optimizer=opt,
  metrics=["accuracy"])

After fine-tuning we need to remove pruning tf.variables that are needed for pruning otherwise it will increase the model size.

stripped_pruned_model = tfmot.sparsity.keras.strip_pruning(pruned_model)

After pruning we will again apply post-training quantization to get quantized model.

converter5 = tf.lite.TFLiteConverter.from_keras_model(stripped_pruned_model)
converter5.optimizations = [tf.lite.Optimize.DEFAULT]
pqat_tflite_model = converter5.convert()

Step3: Benchmark on Android Device

Android Debugging Bridge(ADB) can be installed in Windows and Linux. I have found these videos very useful to setup Link. and get familiarized with ADB Link. Once ADB is set up, mobile can be controlled via computer with ADB commands using USB or wifi connectivity. Remember android devices should be in USB Debugging mode or Wireless debugging mode with USB or wifi connection respectively. Android emulator can also be used for benchmark the models. Android emulator can be found using Android Studio. (Android Studio contains both an emulator and ADB). adb devices will list all Android devices/emulators connected to the computer. If it does not work or if it shows offline, try to connect Android again by removing the ADB connection using adb kill-server and try again using adb devices

Once the Android device connection with the computer is established with ADB, benchmark the files and target TFLite models will be sent to the Android device using ADB commands. There are two options for using the benchmark tool with Android.

Native benchmark binary
Android benchmark app, a better tool to measure how the model would perform in the app.

Both benchmark files are available in android_aarch64 and android_arm at Link. To find your Android device-specific architecture and system details, you can download a droid hardware app or similar app from Play Store.

Android benchmark app

once Android benchmark file is downloaded, keep TFLite models and Android benchmark app in same folder and open terminal in that folder, start following lines

adb devices
adb install -r -d -g android_aarch64_benchmark_model.apk  # for android_aarch64 benchmarking  file
adb push mobilenet.tflite /data/local/tmp # for model  file
adb shell am start -S -n org.tensorflow.lite.benchmark/.BenchmarkModelActivity \--es args '"--graph=/data/local/tmp/mobilenet.tflite --num_threads=4"'
adb logcat | findstr "Inference timings"

Native benchmark binary

once the Native benchmark file is downloaded, keep the TFLite models and the Native benchmark file in the same folder and open the terminal in that folder, start following the lines

adb push android_aarch64_benchmark_model /data/local/tmp # for android_aarch64 benchmarking  file
adb shell chmod +x /data/local/tmp/android_aarch64_benchmark_model
adb push mobilenet.tflite /data/local/tmp # for model  file
adb shell /data/local/tmp/android_aarch64_benchmark_model --graph=/data/local/tmp/mobilenet.tflite --num_threads=4

graph is a required parameter.

graph: string The path to the TFLite model file. You can specify more optional parameters for running the benchmark.
num_threads: int (default=1) The number of threads to use for running the TFLite interpreter. A thread is a virtual component that handles the tasks of a CPU core. Multithreading is the ability of the CPU to divide up the work among multiple threads instead of giving it to a single core, to enable concurrent processing. The multiple threads are processed by the different CPU cores in parallel, to speed up performance and save time.
use_gpu: bool (default=false) you can enable the use of GPU-accelerated execution of your models using a delegate. Delegates act as hardware drivers for TensorFlow Lite, allowing you to run the code of your model on GPU processors.
use_nnapi: bool (default=false) Use Android Neural Networks API (NNAPI) delegate. It provides acceleration for TensorFlow Lite models on Android devices with supported hardware accelerators including, Graphics Processing Unit (GPU), Digital Signal Processor (DSP), and Neural Processing Unit (NPU).
use_xnnpack: bool (default=false) Use XNNPACK delegate. XNNPACK is a highly optimized library of neural network inference operators for ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS, and Emscripten environments.
use_hexagon: bool (default=false) Use Hexagon delegate. This delegate leverages the Qualcomm Hexagon library to execute quantized kernels on the DSP. Note that the delegate is intended to complement NNAPI functionality, particularly for devices where NNAPI DSP acceleration is unavailable.

Result

The Android Device used for this project is Xiaomi Mi A2 with octa-core processor and Adreno512 GPU. During benchmark, 4 CPU threads are used. Runtime memory and model size are in MB while inference time is an average time in microseconds.

+--------------------------+------------+---------------------+---------------------+----------------------+
|    Optim. Technique      | Size (MB)  | InferTime_CPU (ms)  | InferTime_GPU (ms)  | InferTime_NNAPI (ms) |
+--------------------------+------------+---------------------+---------------------+----------------------+
| Base_Model               |       8.5  |               6.72  |               5.17  |                 9.38 |
| Dynamic Range Quantized  |      2.14  |               8.37  |               5.07  |                 8.30 |
| Float16 Quantized        |      4.26  |               7.56  |               5.17  |                 6.55 |
| Clustered and Quantized  |      2.14  |               8.89  |               5.21  |                 8.49 |
| Pruned and Quantized     |      2.14  |               8.03  |               5.05  |                 7.48 |
+--------------------------+------------+---------------------+---------------------+----------------------+

Conclusion

In this project, different optimized models have been compared on Android devices. Dynamic quantization plays remarkably well among these optimized models. This project can be extended to different datasets, models, and hardware to see the performance of optimization techniques.

References

Pyimagesearch Website: Fire and smoke detection with Keras and deep learning link: https://www.pyimagesearch.com/2019/11/18/fire-and-smoke-detection-with-keras-and-deep-learning/
YouTube Website: tinyML Talks: A Practical guide to neural network quantization link: https://www.youtube.com/watch?v=KASuxB3XoYQ
Medium Website: Neural Network Compression using Quantization link: https://medium.com/sharechat-techbyte/neural-network-compression-using-quantization-328d22e8855d
Tensorflow Website: Performance Measurement link: https://www.tensorflow.org/lite/performance/measurement
Tensorflow Website: Post Training Quantization link: https://www.tensorflow.org/lite/performance/post_training_quantization#dynamic_range_quantization
Tensorflow Website: Model Optimization Pruning link: https://www.tensorflow.org/model_optimization/guide/pruning
Tensorflow Website: Weight clustering link: https://blog.tensorflow.org/2020/08/tensorflow-model-optimization-toolkit-weight-clustering-api.html
Tensorflow Website: pruning link: https://blog.tensorflow.org/2019/05/tf-model-optimization-toolkit-pruning-API.html
Tensorflow Website: Quantization-aware training optimization link: https://blog.tensorflow.org/2020/04/quantization-aware-training-with-tensorflow-model-optimization-toolkit.html
Icon reference on website: https://www.stickpng.com/img/bots-and-robots/android-robot-sideview-character