TensorFlow Model Optimization Toolkit — float16 quantization halves model size

TensorFlow
Aug 5 · 3 min read

We are very excited to add post-training float16 quantization as part of the Model Optimization Toolkit. It is a suite of tools that includes hybrid quantization, full integer quantization, and pruning. Check out what else is on the roadmap.

Post-training float16 quantization reduces TensorFlow Lite model sizes (up to 50%), while sacrificing very little accuracy. It quantizes model constants (like weights and bias values) from full precision floating point (32-bit) to a reduced precision floating point data type (IEEE FP16).

Post-training float16 quantization is a good place to get started in quantizing your TensorFlow Lite models because of its minimal impact on accuracy and significant decrease in model size. You can check out our documentation here (including a new float chart!) to help walk you through the different quantization options and scenarios.

Benefits of reduced precision

There are multiple benefits to reduced precision, especially when deploying to the edge:

  • 2x reduction in model size. All constant values in the model are stored in 16-bit floats instead of 32-bit floats. Since these constant values typically dominate the overall model size, this usually reduces the size of the model by about half.
  • Negligible accuracy loss. Deep learning models are frequently able to produce good results on inference while using fewer bits of precision than they were originally trained with. In our experimentation across several models we found little loss in inference quality. For example, below we show a <0.03% reduction in Top 1 accuracy for MobileNet V2. (see results below).

2x reduction in size, negligible accuracy tradeoff

Post-training float16 quantization has minimal impact on accuracy and results in ~2x reduction in size for deep learning models. For example, here are some results for MobileNet V1 and V2 models and a MobileNet SSD model. The accuracy results for MobileNet v1 and v2 are based on the ImageNet image recognition task. The SSD model was evaluated on the COCO object recognition task.

bar graph of model size
bar graph of model size

Model accuracy

The standard Mobilenet float32 models (and fp16 variants) were evaluated on the ILSVRC 2012 image classification task. The Mobilenet SSD float32 model and its fp16 variant were evaluated on the COCO Object Detection task.

Tables of ImageNet accuracy and COCO object detection task
Tables of ImageNet accuracy and COCO object detection task

How to enable post-training float16 quantization

You can specify post-training float16 quantization on the TensorFlow Lite converter by taking your trained float32 model, setting the optimization to DEFAULT, and the supported types of the target spec to the float16 constant:

Once your model is converted, you can run it directly, just like any other TensorFlow Lite model. By default, the model will run on the CPU by “upsampling” the 16-bit parameters to 32-bits, and then performing operations in standard 32-bit floating point arithmetic. Over time, we expect to see more hardware support for accelerated fp16 calculations, allowing us to drop the upsample to float32 and compute directly in these half precision values.

You can also run your model on the GPU. We’ve enhanced the TensorFlow Lite GPU delegate to take in the reduced precision parameters and run with them directly (instead of converting to float32 as is done on the CPU). In your app, you create the GPU delegate via the TfLiteGpuDelegateCreate function (documentation). When specifying the options for the delegate, be sure to set precision_loss_allowed to 1 to use float16 operations on the GPU:

For an overview of the GPU delegate, see our previous post. Check out a working example of using float16 quantization in this colab tutorial.

We encourage you to give this a try right away and give us your feedback. Share your use case directly or on Twitter as #TFLite and #PoweredByTF.

Acknowledgements:

T.J. Alumbaugh, Andrei Kulik, Juhyun Lee, Jared Duke, Raziel Alvarez, Sachin Joglekar, Jian Li, Yunlu Li, Suharsh Sivakumar, Nupur Garg, Lawrence Chan, Andrew Selle

TensorFlow

Written by

TensorFlow is a fast, flexible, and scalable open-source machine learning library for research and production.

TensorFlow

TensorFlow is an end-to-end open source platform for machine learning.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade