Step into On-device Inference with TensorFlow Lite

Published in

the-ai.team

5 min readSep 1, 2021

TensorFlow Lite (https://www.tensorflow.org/lite)

On device AI is one of the latest fast growing technologies which allows devices to run machine learning models on the device. This makes sure of low latency, high reliability, and more privacy. Apple’s “Hey Siri” feature and Face ID technology, “Now Playing” music recognition feature on Google Pixel smartphones are some of the best examples for on-device inference. TensorFlow Lite is an open source framework designed for on-device inference. In this article, we are going to discuss TensorFlow Lite.

Cloud-based vs On-device

Processing power, available resources, and energy consumption need to be contemplated when using on-device inference tasks on mobile devices. Recent research shows that there is a notable difference between the energy consumption in cloud-based inference mobile applications and on-device inference based mobile applications when it comes to resource consumptions (Guo, 2018). Many companies such as Samsung and Qualcomm are trying to maximize the performance in order to meet on-device AI demands.

Table 1. Cloud-based and on-device deep inference — Comparison (Guo, 2018)

Process Flow

How can we use TensorFlow Lite? The development workflow is shown in Figure 1.

Figure 1. TensorFlow Lite Development Workflow (self-composed)

First, you have to choose a model to use. You can either use a custom model developed by you or you can pick a pre-trained model from TensorFlow website. TensorFlow Lite Interpreter is the tool which runs your model in the device. In order to use your model with this interpreter, the model needs to be in a suitable format. Here, it uses an optimized FlatBuffer format which comes with .tflite file extension.
So, how do we convert our model to .tflite? This is where TensorFlow Lite Converter comes to the stage. It will convert your model which is developed using TensorFlow to .tflite file. If you are hoping to use a pre-trained model from the TensorFlow website, you don’t need to convert the model. Those are already converted to .tflite format. 😁
After all these steps, you can deploy your converted model to your application. Further, if you are not satisfied with your model’s efficiency or if your converted model’s size is too much, you can use TensorFlow Model Optimization Toolkit to sort out the above issues.

Model Optimization

As discussed previously, on-device inference consumes many resources when compared with the cloud-based approach. However, mobile devices or edge devices usually come with limited resources. Model optimization will help in overcoming this issue; however, there might be a small accuracy drop due to the optimization.

Reduce Model Size

Quantization can be used to reduce the size of the model. A comparison of the sizes of the basic and quantized model of a fully connected neural network with two hidden layers done by recent research (Dokic, Martinovic and Mandusic, 2020) is depicted in Figure 2. They used TensorFlow Lite for Microcontrollers for their research.

Figure 2. Size comparison of basic and quantised model (Dokic, Martinovic and Mandusic, 2020)

Reduce Inference Latency

Quantization can be used to reduce the inference latency by simplifying the calculations, which might however result in a minor drop in accuracy.

TensorFlow Lite supports different kinds of hardware accelerators. Using the mobile GPU inference engine in TensorFlow Lite, developers can leverage mobile GPUs for model inference.

You can learn more about quantization on TensorFlow Lite here.

Mobile GPU Inference Engine in TensorFlow Lite

Lee, Juhyun et al. discussed the architectural design of TensorFlow Lite GPU (TFLite GPU) which works on both Android and iOS devices. According to the publication (as of 2019), TFLite GPU was able to achieve an average acceleration of 2–9X for different kinds of DNNs when compared with inference on CPU.

Figure 3. Comparison of average inference latency (in milliseconds) of TFLite GPU and CPU inference on different kinds of neural networks on various smartphones (Lee, Juhyun et al., 2019)

Table 2. Comparison of the average inference latency (in milliseconds) of iOS supported machine learning frameworks on MobileNet v1. (Lee, Juhyun et al., 2019)

Table 3. Comparison of the average inference latency (in milliseconds) of Android supported machine learning frameworks on MobileNet v1. Quoted from the paper: Note that TFLite GPU employs OpenGL and thus has the widest coverage with reasonable performance. MACE and SNPE employ OpenCL and may run faster on devices shipped with OpenCL, but may not run on all devices. 1 Arm Mali GPUs are not compatible with SNPE. 2 Google Pixel devices do not support OpenCL. (Lee, Juhyun et al., 2019)

TensorFlow Lite offers some interesting libraries too. Using the following library, a developer can execute a machine learning model using few lines of code.

TFLite Support

TFLite Support is a toolkit which is used to develop and deploy TFLite models onto mobile devices. It is available on Java, C++, and Swift. It comes with a library which can be used to deploy TFLite models onto mobile devices, a metadata populator and extractor library, a codegen tool to generate a wrapper for the model based on the support library and metadata, and easy-to-use APIs to run model inference for common model types. Currently, it supports tasks related to vision, text, and audio. Further, it allows the developer to build custom APIs and the developer only needs to call the APIs providing necessary parameters such as model path and inputs. Processes such as data encoding are handled by the library.

GitHub Repository: https://github.com/tensorflow/tflite-support

Documentation: https://www.tensorflow.org/lite/inference_with_metadata/task_library/overview

In this article, we covered some interesting things about TensorFlow Lite. Even though it is not a tutorial, we are pretty sure you were able to gain some knowledge on the difference between on-device inference and the cloud-based approach, how to use TensorFlow Lite, its mobile GPU inference engine, and TFLite Support Library. Drop a comment below and let us know what you think. :)

References

TensorFlow. 2021. TensorFlow Lite | ML for Mobile and Edge Devices. [online] Available at: https://www.tensorflow.org/lite.

Guo, T., 2018. Cloud-Based or On-Device: An Empirical Study of Mobile Deep Inference. 2018 IEEE International Conference on Cloud Engineering (IC2E),.

Dokic, K., Martinovic, M. and Mandusic, D., 2020. Inference speed and quantisation of neural networks with TensorFlow Lite for Microcontrollers framework. 2020 5th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM),.

Lee, Juhyun, Nikolay Chirkov, Ekaterina Ignasheva, Yury Pisarchyk, Mogan Shieh, F. Riccardi, Raman Sarokin, Andrei Kulik and Matthias Grundmann. “On-Device Neural Net Inference with Mobile GPUs.” ArXiv abs/1907.01989 (2019): n. pag.

GitHub. 2021. GitHub — tensorflow/tflite-support: TFLite Support is a toolkit that helps users to develop ML and deploy TFLite models onto mobile / ioT devices.. [online] Available at: https://github.com/tensorflow/tflite-support.