Create Artistic Effect by Stylizing Image Background — Part 2: TensorFlow Lite Models
In the previous post, Margaret introduced the project covering different scenarios where it could be useful along with other technical objectives.
In this post, we will provide some more details about the models we used, some major bits from their conversion process in TensorFlow Lite (TFLite), and the benchmarking results of the models. You can follow along with the materials as enlisted here.
The project includes two types of models (semantic segmentation model and stylization model) as mentioned in the intro blog post (TODO: update link). For both the segmentation and stylization models, we have a number of different model variants to choose from TensorFlow Hub’s model repository (segmentation models and stylization models). In this section, we will discuss some of the major bits from their conversion process.
Converting Semantic Segmentation Models
All of the code presented in this section is demonstrated in this Colab Notebook. The semantic segmentation models are based on DeepLabV3. To perform the conversion, we will be using the pre-trained checkpoints provided by the DeepLab authors.
Each of the different DeepLab model files ship with the following files -
We’ll only be using
frozen_inference_graph.pb which is a frozen inference graph. DeepLabV3 checkpoints are spread across three different datasets — PASCAL VOC 2012, CityScapes, and ADE20k. You can find more information about all the available ones from here. Let’s start with the
mobilenetv2_coco_voctrainval model which is associated with the PASCAL VOC 2012 dataset.
The model files can be downloaded from this link. After the file is un-tared, the following code listing generates the TFLite model -
There are a couple of important things to note here. If we inspect the
frozen_inference_graph.pb file in Netron to check the input and the output tensors of the graph, we would see the following -
As we can see that the input (
sub_7) and the output (
ResizeBilinear_2) tensors that we specified in the code listing above vary in our SavedModel from the original graph. Why’s that?
If you see the part of the graph before
sub_7 actually does the preprocessing steps which allow the original model graph to handle images of dynamic shapes. Currently (as of November 2020), TFLite models do not support handling dynamic shapes on GPU delegates. So, in order, to speed up the execution of the converted TFLite model, the input tensor was selected this way. This is also why we’d have to perform these preprocessing steps before feeding an input image/video frame to the TFLite model. In the following figure, we can see that
sub_7 has a fixed output shape -
The same reason also applies to why we chose the output tensor of the TFLite model in that way. In this case, as well, we’d need to implement the steps (post-processing) performed in the original model graph after the
ResizeBilinear_2 tensor. Additionally, as we can see in Figure 2, the output tensor of the original model graph passes through an
ArgMax operation that is not yet supported by the GPU delegates of TF Lite. So, in summary, while preparing the TFLite model we made sure of the following:
- Exclude operations that cannot run on TFLite delegates.
- Exclude operations that have dynamic output shapes.
Again, note that we would need to perform the operations we did not include in the TFLite model graph on the input images/video frames before feeding them to the TFLite model.
By now, we should have a TFLite model file generated that reflects the above-discussed considerations. The MobileNet variants of the DeepLab models would be about 2.3 MB in size when converted to TFLite which is perfectly usable in a mobile application.
Converting Image Stylization Models
In the case of stylization, there are actually two models that are at play here -
- Style predict model, that calculates the feature bottlenecks from the style image (the image whose style you would like to extract).
- Style transfer model, that takes a content image and the pre-computed feature bottlenecks from the style image and actually generates the final stylized image.
TensorFlow Lite primarily offers three post-training quantization strategies for model conversion — dynamic-range, float16, and integer. For most of the parts, the conversion process is pretty straightforward except for the integer quantization. It requires you to provide a representative dataset so that the TFLiteConverter can calibrate the dynamic activation ranges.
You can refer to this Colab Notebook to follow along with the code snippets discussed below.
For the style predict model, this roughly translates to the following code listing -
We are prefixing the input shape so as to restrict the model from accepting dynamic shapes. More so, because our ultimate objective is to deploy these models to mobile phones and fixed-shaped inputs often result in better performance.
For the style transfer model, things may seem more complicated since it takes two different inputs. One of these inputs should be directly calculated from the style predict model. To perform the integer quantization in this case, we need to think about how we can generate the representative dataset. In the following figure, we present a brief schematic of representative dataset generation -
In code, this would look like so -
Now, that we have the generators in place, we can proceed toward the actual model conversion process -
As we can see we need to specify the indices for the two different inputs our model would deal with. These indices are dynamically generated during the model loading time. Hence it’s important to specify them accordingly during the conversion process.
In order to run inference in Python with the models discussed using the
TFLiteInterpreter, you can follow along with the notebooks mentioned here.
After converting the models to .tflite, we used the Benchmark tool to get different device-specific statistics about these TensorFlow Lite models such as average inference time, peak memory usage, and so on.
The figure below compares the average inference time (in milliseconds) of the different segmentation models -
From the figure above, we can get a sense of the model’s performance under different computing capabilities of a particular device. Now, as our final goal is to use these models inside a mobile application, the model sizes matter a lot. Next, we compare their sizes -
Additionally, we benchmarked an official TensorFlow Lite model for performing semantic segmentation. This model is based on a MobileNetV2 backbone with a depth multiplier 0.5 but trained on resolutions of 257x257 images -
After running the benchmarking for the stylization models we got -
Float16 models have a clear advantage of being executed faster on a GPU, although their sizes are almost double compared to the other variants.
There’s always a trade-off between “size (vs) accuracy (vs) latency” when choosing models for your applications. But as our target runtime is primarily mobile phones where we want to generate fun artistic images quickly we decided to go with the models that are the fastest -
If you are interested to see how we put these models to use refer to either this blog post discussing the Android implementation written by George or this one discussing iOS implementation (to be coming soon).