Copista: Training models for TensorFlow Mobile

Andrew G
5 min readNov 6, 2017

--

Tinkering with Deep Learning

2. Tools

For those who missed the first part Copista: Developing Neural Style Transfer application with TensorFlow Mobile, this blog is a software engineer take on Machine Learning.

In this part, you can find the tools and tricks used to train mobile models for Neural Style Transfer Android application based on TensorFlow Mobile: Copista — Cubism, expressionism AI photo filters.

When I got bitten by the machine learning bug the first work I came across was Johnson’s Perceptual Losses for Real-Time Style Transfer and Super-Resolution

As an implementation I picked up Chainer fast neural style by Yusuke Tomoto. It is a very clean written straightforward implementation that I managed to start using quite fast. It took me couple of day to set it up and start training the models on the cheapest AWS EC2 instance with GPU.

I started with g2.2xlarge instance, but after a week I got tired waiting 24 hours to get one model trained.
So I upgraded from g2.2xlarge instance to p2.xlarge instance and the things got faster. After a week of training I had a dozen of nice trained models and faced one big problem I should had anticipated :)

I trained Chainer models and I could not find how to translate them into TensorFlow models!?! So that I could run them on TensorFlow Mobile.
It was a huge setback — I was thinking to drop the ball…

I conducted a little research just to find out that most often there is one way translation from Caffe models to TensorFlow or to Chainer, but there is no out of the box tool to translate Chainer models to TensorFlow. (I was thinking to write it myself but decided against it because of the time factor.)

I left Chainer and Yusuke Tomoto code as a reference and started to look for the corresponding TensorFlow code. It turned out that there were several implementations but almost all of them were limited the input image to some predefined size (for example 256 * 256). From the beginning I wanted to be able to get any image transformed, so after several days I found the code I could start with. I settled for this tensorflow implementation for fast neural style by Zhiyuan He. The code is very close to the Chainer implementation I used.

I should mention that training in Chainer is a lot faster than in TensorFlow. It’s a pity that Chainer does not have the export tools to other frameworks. Chainer is also more flexible than TensorFlow and is good for experiments.

Anyway, I started with the code above. Most of academic code for TensorFlow stops at checkpoint file format for the trained models. (See about saved models here.) The code I used had the code that exports ckp-done (checkpoint) model file format to freeze graph (pb) file format.

I found this super useful guide Deploying a TensorFlow model to Android by Yoni Tsafir that saved me a lot of time! I want to emphasize the idea that you should validate your models as soon as you can!

Tip #1: Validate your models on mobile ASAP!

I made a mistake of delaying testing models on mobile. I concentrated on models optimization just to find out weeks later (!!) that I could not run the optimized models on mobile. It turned out that Python environment a lot more forgiven than C/Java. It took me couple of days to figure out how to write the model code so that TensorFlow Mobile could load it and run without problems.

Here some more tips that I hope you would be useful.

Tip #2: Python TensorFlow is more forgiving than C/Java TensorFlow

Python Tensorflow is more forgiving than Tensorflow C/Java

Validate the input type. I have no explanation but Python Tensorflow environment processes both int32 and uint8 input types with no difference, but C/Java Tensorflow is very sensitive.

with g.as_default():with tf.Session() as sess:# Building graph.image_data = tf.placeholder(tf.int32, name='input_image')

For C/Java Tensorflow to work I had to change tf.int32 to tf.uint8 which is correct by the way. The image data is actually tf.uint8

image_data = tf.placeholder(tf.uint8, name='input_image')

In order to have various input image sizes you have to declare the shape implicitly so that C/Java TensorFlow can understand it

image = tf.placeholder(tf.uint8, name='input_image', shape=[None, None, 3])

Tip #3: transform_graph is super useful

Do not optimized a model till you have it first working on mobile. It’s just you should follow the common sense. Start with something working from end to end than check if you can optimize it.

I found convenient to train models and get frozen models (pb files) on AWS. But to optimize I use my laptop, so I can play with various parameters and push the result model into mobile device with no delay.

My favorite tool is transform_graph :)

Here MODEL_SRC and MODEL_O are directories for source and optimized models

# obfuscate model functionfunction obfuscate {  re_match=r  replace=o  tgt=$(echo $i | sed -e "s/$re_match/$replace/")  echo $tgtbazel-bin/tensorflow/tools/graph_transforms/transform_graph \ --in_graph=$MODEL_SRC/$i \ --out_graph=$MODEL_O/$tgt \ --inputs='input_image' --outputs='output_image' --transforms='  add_default_attributes  fold_constants(ignore_errors=true)  fold_batch_norms  fold_old_batch_norms  quantize_weights  strip_unused_nodes  obfuscate_names  sort_by_execution_order'}# compile transformation tool in case its not readybazel build tensorflow/tools/graph_transforms:transform_graphretval=$?if [ $retval -ne 0 ]; then  exit $retvalfi# obfuscate modelsfor i in $( ls $MODEL_SRC );do  obfuscate $idone

I found the following options particularly useful for model optimization: quantize_weights, strip_unused_nodes, obfuscate_names, sort_by_execution_order.

Tip #4 Neural networks are CPU and memory greedy!

Keep in mind that the computational graph of neural networks is CPU and memory greedy! So some models that can work well on server side with almost limitless memory and CPU would have a hard time to run on mobile.

RunStats tool gives you some insights about CPU and memory usage by the model during the inference time.

Here for example the output from RunStats summary for one of the models. (I left some columns to make it fit the page.)

================== Summary by node type =========================
[node type] [avg ms] [%] [mem KB] [Name]
Conv2D 403.345 12.96% 9437.184 deconv3/conv/conv
Conv2D 298.677 9.60% 25108.512 conv1/conv/conv
Conv2DBack 229.129 7.36% 81788.928 deconv2/conv_transpose
Conv2D 129.523 4.16% 6291.456 res1/residual/conv_1/conv
RealDiv 125.33 4.03% 0 deconv3/div
Conv2D 121.248 3.90% 6291.456 res2/residual/conv/conv
Conv2D 119.328 3.84% 6291.456 res3/residual/conv_1/conv
Conv2DBack 117.662 3.78% 40894.464 deconv1/conv_transpose
Conv2D 115.908 3.73% 6291.456 res1/residual/conv/conv
Conv2D 109.738 3.53% 6291.456 res2/residual/conv_1/conv

In the next part I will talk about high resolution images support in Copista:

--

--