A guest post by the SmileAR Engineering Team at iQIYI
Introduction: SmileAR is a TensorFlow Lite-based mobile AR solution developed by iQIYI. It has been deployed widely in iQIYI’s many applications, including the iQIYI flagship video app (100+ million DAU), Qibabu (popular app for children), Gingerbread (short video app) and more.
In this article, we’d like to give you a short overview of SmileAR and our experience using TensorFlow Lite. If you’re new to us, iQIYI is a leading online entertainment company in China and our vision is to be a great entertainment company powered by technological innovation. Innovation is an essential part of iQIYI DNA and we strive to continuously improve the entertainment experience of our users through technological advancement, including adopting cutting-edge research and development in AI (Artificial Intelligence) and AR (Augmented Reality).
SmileAR is an in-house mobile AR solution developed at iQIYI using TensorFlow Lite. SmileAR implements many key machine learning algorithms, such as human pose estimation, portrait segmentation, gesture recognition, and object detection. Building on top of these underlying algorithms, we have developed higher-level AR features, such as dance scoring, AR scan, beautification filters and so on. At present, SmileAR has been deployed in many products owned by iQIYI, including the flagship iQIYI video app (100m+ DAU), Qibabu (popular app for entertaining children), Gingerbread (a.k.a. Jiangbing, short video app) and iQIYI Live (live streaming app).
Sample use cases in SmileAR
Gesture recognition and tracking
Gesture recognition is an important user interaction method on mobile devices. It brings a fun experience to our short video apps and our users can now trigger fascinating AR effects with hand gestures. In our SmileAR solution, the Single Shot Multibox Detection (SSD) model is used for gesture recognition, while MobileNet is used as the backbone network. We also used TensorFlow Lite quantization-aware training to improve the execution speed. Our gesture recognition module can achieve real time performance on mobile devices. Right now, iQIYI gesture recognition has been deployed in the flagship iQIYI video app, iQIYI Live Broadcast Machine app, Gingerbread short video app, and others.
AR Scan is a feature frequently used in iQIYI’s main app: a user can scan a specific logo or object to trigger events, such as opening a webpage or performing animation. Our AR Scan functionality is basically built with a classification model (we use MobileNet) and to cut down the inference time, we reduce the depth of the MobileNet to 0.75. By using TensorFlow Lite, we are able to achieve real-time scanning purely on device (~30ms inference time on the Kirin 970 SoC). And we further shrink our model down to 1.77M while maintaining high accuracy by using quantization. At present, AR Scan has been deployed in many scenarios such as TV drama interaction and event marketing, and delivered great results.
Human pose estimation
As part of our iQIYI SmileAR solution, human pose estimation has been used in the Cute Baby Dancing Room feature of our Qibabu app. In the Dancing Room our users (typically children) try to imitate the dancing moves of professional coaches. When the app is on, our algorithm closely monitors user body movements, and calculates the similarity between them and the demo movements to determine how well the user is doing. When the user is following the demo closely, our app will automatically trigger certain special effects (such as falling fruits) on the screen, which encourages the user to learn more and keep improving. Since its launch, this feature has been highly praised by parents.
Our human pose estimation has also been used in human gesture recognition, such as ‘Arm Heart’ gesture (a.k.a. 比心 [bixin], putting your arms above your head so that they form a heart shape), ‘World Peace’ gesture (a handsome gesture proposed by Chinese artists, which signifies the hope for world peace), and ‘Superman’ gesture (the classical gesture of Superman imitated by many people). We use machine learning algorithms to analyze user body movements, normalize their posture and compare them with the reference gesture to determine whether we should trigger special animation effects for the users.
Algorithm optimization for mobile phones
Deep learning inference requires a significant amount of computation. Although modern mobile CPU and GPU are becoming more powerful these days, it remains a challenge to do on-device machine learning in real time. Our SmileAR platform needs to solve a diverse set of problems with minimal latency to serve our users, so we pushed the limit and optimized our algorithms and implementations to meet our business requirements.
In order to improve the speed of inference, we adopt a common model acceleration technique: replacing the complex backbone network with MobileNet V2 network, which is highly optimized for mobile devices. For the use cases where accuracy requirement is not so high, we further reduce the input size of the network and the number of channels in MobileNet V2 network. These optimization techniques significantly accelerate the execution of our algorithms.
Quantization-aware training for model acceleration
Because floating point arithmetic is slower on mobile devices than fixed point arithmetic, the quantization-aware training scheme is used for further speedup. TensorFlow Lite has perfect support for quantization. The quantization-aware training can be easily implemented with just two extra lines of code. We first obtain the converged float point model through normal training, and then carry out the quantization-aware training to get a fixed point model. The CPU inference speed gain on popular SoC’s, such as Qualcomm Snapdragon, Huawei Kirin and MTK Helio, is shown below. We plan to work closely with the TensorFlow team in the future to further improve these numbers.
Escaping from the local optimum
Our human pose estimation implementation also uses a multi-task learning algorithm. Heatmap estimation is one of these tasks. When training our model, sometimes some of the heatmaps become all zero (basically the algorithm is stuck in a local optimum, because the loss of an all-zero heatmap is minimal). In order to solve this problem, we designed an auxiliary loss function to penalize the all-zero heatmaps, and finally our algorithm was able to escape the local optimum.
Deployment on Mobile devices
In order to achieve cross-platform deployment (Android, iOS and Windows), we use the C++ interface of TensorFlow Lite. We compile TensorFlow Lite into static libraries for different platforms, and then encapsulate platform-dependent modules using Objective-C and Java for iOS and Android respectively. In this way, we maximize the reusability of our mobile native code and minimize the cross-platform development efforts. On the Windows side, we do the same encapsulation with C++, and successfully use TensorFlow Lite for various features. Even when the input is the challenging 1080P high resolution video, we can still run our TensorFlow Lite models in real time.
When our SmileAR SDK is delivered internally, it can be packaged in a clean and flexible manner to suit the needs of our various business units. These business units only need to call the corresponding Java or Objective-C APIs, without worrying about our underlying implementations, which are hugely convenient to integrate into our mobile products. At the same time, because TensorFlow Lite can run directly on Windows, we also directly use SmileAR in our Windows client for the live streaming feature.
SDK authorization and model protection
Before final delivery, our model files are encrypted by iQIYI’s in-house encryption schemes to guarantee the security of our mobile models. Meanwhile, in order to protect iQIYI’s intellectual property rights, we have also added license authorization verification.
The mobile networking environment is diverse and complex, and app package size is a critical factor for successful downloads and the end user experience. We use link time optimization to further reduce the size of the TensorFlow Lite library. TensorFlow Lite registers all operators by default, while we only need a subset of the operators in practice, so all unused operators can be eliminated. When the static library chain is connected to the dynamic library or the executor, the unused code will not be included in the linked program by the linker. To do this, we inherit MutableOpResolver, and selectively register the operators needed in our models. This flexible scheme does not affect the original TensorFlow Lite static library code, but allows us to add and delete operators dynamically.
Taking SmileAR as an example, we discuss the adoption of TensorFlow Lite inside iQIYI. TensorFlow Lite, as an efficient inference engine for mobile devices, has great cross-platform support (Android, iOS, and Windows 64-bit for us). By using TensorFlow Lite, we have saved considerable engineering time and resources. Its excellent execution efficiency enables many algorithms to run in real time on mobile devices and powers the deployment of our AI/AR features. Its rich tooling, such as the benchmark tool, can help us quickly identify the bottleneck of our models and provide powerful assistance to accelerate model inference. With the help of TensorFlow Lite, SmileAR has been successfully deployed in many iQIYI apps.
As an important use case of iQIYI AI technologies, SmileAR plans to add spatiotemporal CNN models, model accuracy optimization and GPU/DSP acceleration on different platforms in the future. Supported by the Google TensorFlow ecosystem, SmileAR will continue to provide efficient and robust AR solutions for iQIYI’s various business scenarios, and improve iQIYI’s user growth and retention, so that all our users can easily and equally enjoy immersive AR experiences.