Image courtesy: https://www.tensorflow.org/lite

Model Quantization Using TensorFlow Lite

Deployment of deep learning models on mobile devices.

Published in

sclable

10 min readMay 11, 2020

Battery and memory are the two most important resources for mobile, edge or IoT devices. They are available in a limited amount compared to the clouds or in-house servers. Directly deploying any deep learning (DL) based trained model will not work as they are resource hungry. Therefore, any DL model-based inference on those devices needs to take into account the following characteristics of the model:(i) they need to be smaller in size to save memory; (ii) they need to consume as little energy to save battery life and (iii) they need to have low latency or, in other words, high inference time so that a user feels that the model is reacting instantaneously. In short, it is crucial to make the model efficient for memory, energy and processor usage for their deployment in mobile and edge devices. In this article, we will go through TensorFlow Lite (open source DL framework for on-device inference) and discuss one of the main methods of optimization called quantization. Quantization helps to reduce the model size and also makes models compatible to run on devices. We will see in detail one of the ways of doing quantization when the model is trained, called post-training quantization.

What is TensorFlow lite (TFLite)?

To meet all the criteria of the models mentioned above, Google provides an on-device inference engine called TensorFlow-lite (TFLite). TFLite is targeted especially to mobile, edge or IoT devices, optimizing for speed, model size and power consumption. It also has support for GPU-based model inference via GPU delegates. These delegates can communicate with the native libraries for GPU acceleration via their APIs. For example, for Android devices, the GPU delegate uses Android Neural Network API and for iOS devices, it uses Metal API for hardware-accelerated inference operations. Fig1. shows the internal architecture of the TFLite. TFLite consists of two main components:

Interpreter: TFLite comes with a very light interpreter (with a size of less than 300 KB) which runs the optimized models on various devices. To run the model on various devices such as Android or iOS, TFLite interpreter comes with multi-language APIs: on Android, TFLite inference can be performed using either JAVA or C++ APIs; on iOS, there are APIs written in Swift and Objective-C and on Linux platforms (including Raspberry Pi), TFLite APIs are available in C++ and Python.
Converter: converts the TensorFlow or Keras model (.pb or .h5) into TFLite model (.tflite) which can be directly deployed in those devices. This file can then be used by the interpreter for inference purposes. The converted model is stored in an efficient file format which uses FlatBuffers. In addition to creating FlatBuffer, TFLite converter can also apply optimizations to the model which reduce the size of the model, inference time or both. However, this optimization can come with the cost of a reduction in accuracy. The kind of optimization used is called quantization, which is the topic of discussion of this article and we will explain it in detail.

Fig1. TensorFlow Lite internal architecture. Image courtesy: https://www.tensorflow.org/lite/convert

What is model quantization and why do we need it?

One of the most useful optimizations that TFLite uses is called quantization. Quantization is using lower-bit representations than higher bits for a given real-valued number. For example, you can represent a continuous real number such as a 32-bit (or, 4 bytes) floating-point number with a discrete number such as an 8-bit (or, 1 byte) integer number. In deep learning, weights and biases (or, simply parameters of your neural network) are stored as 32-bit floating-point numbers so that high precision calculation can happen during model training. And, when the model is trained, they can be reduced to 8-bit integers which eventually means a reduction of the model size. They can either be reduced to 16-bit floating points (2x reduction in size) or 8-bit integers (4x reduction in size). Of course, this might come with the tradeoff in the accuracy of the model’s predictions. However, it has been empirically shown in many situations that a quantized model does not suffer from a significant decay, or no decay at all, especially when using 16-bits floating points, which reduces the model’s size by half (See Fig 4). Similarly, we can also quantize the activation values, however, it requires a calibration step to determine some scaling parameters from a representative dataset. This is explained in detail in the “Full quantization” sub-section. Figures 2, 3 and 4 show some examples regarding the effect of full quantization (weights and activations values to 8-bit INT) on model size, latency time and accuracy, respectively. They have used Google Android Pixel 2 for the experiments.

Fig2. Model size comparison. Image courtesy: https://blog.tensorflow.org/2018/09/introducing-model-optimization-toolkit.html

Fig3. Latency time comparison. Image courtesy: https://blog.tensorflow.org/2018/09/introducing-model-optimization-toolkit.html

Fig4. Accuracy comparison. Image courtesy: https://blog.tensorflow.org/2018/09/introducing-model-optimization-toolkit.html

What are the quantization possibilities?

There are two options that TFLite provides for a model quantization:

(i) post-training quantization: simply entails the quantization of the parameters after the model is trained. This is the main focus of this article and we will present it in more detail. And,

(ii) quantization-aware training: entails quantizing the model during the training time. It requires modification to the network before initial training (uses fake quantization nodes) and it learns the 8-bit weights through training rather than conversion later. It is only available at the moment for a subset of CNN architectures. This is not the focus of this article.

Post-training quantization

Post-training quantization does not require any modifications to the network, so you can convert a previously-trained network into a quantized model, for example, 32-bit FP to 16-bit FP or 8-bits INT. This post-training quantization can be further divided into three parts depending on (i) if no quantization is performed, which simply means a conversion of TensorFlow model to a “.tflite” file format; (ii) if only weights of the model are quantized. It is also called hybrid quantization and (iii) if, along with weights, activations are also quantized. It is called full quantization. It is quite important for all the methods mentioned below in order to check the accuracy of the quantized model and ensure that the degradation is acceptable.

Depending upon different hardware that has been chosen for inference purposes on mobile or edge devices, there are different kinds of quantization that you might want to apply, such as:

for running a model in a CPU (e.g., ARM processors), 8-bit INT or no quantization (32-bit FP) can work. Models which are converted to 16-bit FP can still be run on the CPU, however, float 16 weights are upsampled to float 32 before the inference.
for a GPU (e.g., ARM Mali, Qualcomm Adreno etc), a reduced 16-bit is a good choice because GPUs can compute with both 16-bit or 32-bit FP which means quantization is not at all a requirement. TFLite GPU delegates need to be configured to execute float 16 data.
for inference on an edge TPU (e.g., Google Coral), full integer (8-bit INT) of both weights and activations is a requirement. Note that it is not same as Cloud TPUs which are used for model training.

The following figure shows a list of possibilities to consider when dealing with post-training quantization.

Fig5. Deciding which post-training quantization strategy to choose. Image courtesy: https://www.tensorflow.org/lite/performance/post_training_quantization

(i) No quantization: Here the trained model is not at all quantized, the TensorFlow model (.pb or .h5) is simply converted into TFLite (.tflite) file. Conversion to TFLite is important because only after that it enables you to deploy your model into a device so that the interpreter can run it. For example, quantization might not be necessary if you are using a very light-weight version of a custom network or using pre-trained MobileNet which is relatively small. Remember quantization may come with the cost of reducing the accuracy of your model. Then, in case of no degradation, it is fine to simply convert your model to TFLite model which can be then imported and interpreted by the TFLite interpreter running on the device itself. Note that your TFLite model’s size will be a little bit smaller than your original TensorFlow model. This is due to the usage of FlatBuffers for creating TFLite models. Check the difference between protocol buffer and FlatBuffers here.

Fig6. Conversion of TF.Keras model to TFLite model without quantization

(ii) Weights/hybrid quantization: Here only the weights of the trained model are quantized, either to 16-bit FP or 8-bit INT. This will reduce your model size 2x or 4x, respectively. In case of 16-bit FP quantized weights, they will be converted back (or, dequantized) into 32-bit FP when running in a CPU since current generation CPUs do not support native 16-bit FP arithmetics. However, 16-bit FP can be a good choice when using a GPU since it can operate on float 16 data. In the case of 8-bit INT quantized weights, some operators (called hybrid operators) which are also able to work with integer data will dynamically quantize activation values to 8-bit INT and perform computations with 8-bit weights and activations. This provides latencies close to fully fixed point inference (e.g., full integer quantization). However, the outputs are de-quantized (or, converted back) to float precision. This can provide a speedup over pure floating-point computation, but the memory footprint will remain the same. Hybrid operators are available for the most compute-intensive part in a network, e.g, fully_connected, conv2d etc.

Fig7: Conversions of weights to 8-bit INT.

Fig8: conversions of weights to 16-bit FP.

(iii) Full quantization: Here we fully quantize the trained model, i.e., quantization of both weight and activation values is performed. Quantization of weights is quite straightforward, as mentioned above, however, quantization of activation values is not so straightforward. What does quantizing activation values mean? The output of an activation function is mapped to, for example [+127, -128] signed INT values. This will be the quantization of the activation values using (signed) 8-bit INT. Now, this conversion or mapping of activations from a floating-point (32-bits FP) to integer requires a calibration step to determine the scaling parameters. These parameters are computed by running several examples of your representative dataset (e.g., testing or validation dataset) through the model. TFLite computes and stores the MIN and MAX values of the activations from the representative dataset. With MIN and MAX values, TFLite maps any floating-point number to [+127, -128]. The activations are then quantized on the fly at inference time.

It is important to note that many mobiles or embedded devices do not support floating-point operations, since most of the time either they lack Floating-Point Units (FPUs) or they are disabled to save power. Therefore, a full-integer quantization (i.e., both weights and activations are converted into 8-bit INT) is usually a requirement. These integers not only require less memory but also arithmetic with integers can run very quickly, hence, they provide low latency time. For example, Google’s Coral Edge TPU supports only TFLite models that are fully 8-bit quantized, therefore, any floating-point operations are not supported and those models will not be compatible. Note that not all operators are supported by Edge TPU from TFLite’s list of compatible operators — a list of supported operators can be found here.

On the other hand, for those cases where integer-based arithmetic is not supported, the 8-bit quantized model will fall back into 32-bit floating-point, if possible.

Fig10: Conversion of both weights and activations to 8-bit INT.

What mobile-friendly deep learning models are available?

The following are some (out of the box) pre-trained-models which are fully supported (i.e., full-integer quantization) by TFLite. For image classification: MobileNet, MobileNetv2, ResNet-50 and Inception-V3; for object detection: MobileNet V1 or V2 with SSD and for semantic segmentation: DeepLAB V1.

Conclusion

We have gone through TensorFlow Lite (TFLite) and one of the most important techniques of model optimization called model quantization. We have seen the “post-training quantization” technique in detail.

It is always important to keep the following things in mind before performing any kind of quantization: (i) the target device specification (which kind of mobile device and where do you want to run inference), (ii) which kind of arithmetic is being supported (FP32 or FP16 or INT8) by the hardware (CPU or GPU or TPU) and (iii) lastly, which kind of operations are supported by TensorFlow Lite.

Github link for the code is provided below, feel free to play around with it.

This article was written for Sclable’s blog on Medium.
If you liked it, give it a clap and share if you ❤️

Github link with an example: https://github.com/sanchit88/tf_model_quant