# Optimizing any TensorFlow model using TensorFlow Transform Tools and using TensorRT

## Model Optimization and reducing precision from FP32 to FP 16 for speedup and reducing graph size.

## What is this article about? Whom is this meant for?

More than an article, this is basically *how to*, on optimizing a Tensorflow model, using TF Graph transformation tools and NVIDIA Tensor RT. This is a bit of a Heavy Reading and meant for Data Science Engineers than Data Scientist. Also, more than a how-to, it also illustrates the bugs and unexpected results I have encountered in doing this.

Also, these results may be unexpected to me in the role of a DS Engineer but may have a rational explanation if I understand or go deeper into the abstractions of the model or the TF framework that implements this model. I don’t have the skill yet and leave it to the reader to point out the mistakes I have made if any. I have raised these two bugs, one with TF Graph Transform and one with NVIDIA TensorRT transform tool, and it gets answered, then I will update this post

https://github.com/tensorflow/tensorflow/issues/28276

## What was my expectation — More CUDA cores, TensorCores == ( I thought) Faster inference.

From NVIDIA TensorCore product description, I got the idea/ hope that if I can convert the FP 32 based Object detection models that we use in production to FP 16 or INT8 models and weights, then I will be able to run twice or four times as fast inference speeds; as advertised.

We had one of the best GPU’s NVIDIA V100 32 GB for our experiments which supported all these modes- Tensor Cores, not to mention a set of other GPU’s server-grade as well as gaming grade, P40, GTX 1080, 1070 etc.

Before we started with these expensive GPU’s we used to run the model in GTX 1080 HP desktops. The first expectation was that the higher number of CUDA cores (~5k) in V100 will make our models run faster. I read blogs like below and knew that 1080 is pretty good and that NVIDIA prices and markets the server side GPU’s higher. ( In 1080 they throttle the FP16 which I guess is removed in 2080).

Here is an excerpt from the above

“2080 Ti vs V100 — is the 2080 Ti really that fast?

If you absolutely need 32 GB of memory because your model size won’t fit into 11 GB of memory with a batch size of 1. If you are creating your own model architecture and it simply can’t fit even when you bring the batch size lower, the V100 could make sense. However, this is a pretty rare edge case. Fewer than 5% of our customers are using custom models. Most use something like ResNet, VGG, Inception, SSD, or Yolo.So. You’re still wondering. Why would anybody buy the V100? It comes down to marketing.”

Also, I was not expecting a drastic speed up anyway with more CUDA cores as even with HD image frames, we found that the GPU utilisation could not touch 100 per cent meaning that the processing alone was not the bottleneck. Again things are grey here. There is a lot more scope of Engineering optimisations here.

Why we choose V100 was not because of ‘marketing’; it was the only GPU with that much memory 32 GB, which would enable us to batch more image frames in parallel, basically do real-time analytics of more HD video cameras on a single edge.

## Reality regarding more CUDA Cores

The truth was that other than the advantage of processing more frames in parallel due to the higher memory, there was no speedup from 1080 GPU (~2.5k CUDA core). We have also tested this in Jetson TX2 which has much fewer (~256) CUDA cores and one older gaming GPU where it was very slow. So higher CUDA cores help to run faster but beyond some threshold, there is not much difference. Maybe this fact is known already and that’s why the newer models from NVIDIA like T4 have around 2.5k CUDA cores, and these are available in GCP and other cloud providers for inference at much cheaper rates. The V100 seems to be used only for training models. It is not practical at least now to train models in lower precision. Maths is easier — gradient descent, back propagation with higher precision.

You can gain more insights from the post by Tom Dettmers regarding GPUs https://timdettmers.com/2019/04/03/which-gpu-for-deep-learning/, but do be wary of things like a number of frames you need to process in parallel etc than just raw inference speeds.

## Reality regarding TensorCores, half precision/lower precision FP16, INT8

The field of Data Science Engineering is still nascent in that there is no clear distinction where Data Science ends and Engineering takes over. Frameworks like Tensorflow Serving and tools helps the Dev or DS Operations team to work on the model, develop generic clients and build on top useful applications on the model. But they treat the model without knowing too much in-depth. So when they take a model and do an Optimisation and get an error like below, they don’t know what to do, than get a real inferiority complex and make a mental note on understanding things better

`details = "input_max_range must be larger than input_min_range.`

[[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/mul_eightbit/Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/sub_1/quantize}}]]

[[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/zeros_like_83}}]]"

With that prelude, and based on the optimisations that worked (partially) on converting an Object detection model — Single Shot Detector from TF model zoo, here are the results I got running in V100. Basically, not much speed up, as many of the layers did not get converted**. I had initially done the experiment on a Keras converted model and got similar results, but then I thought that if I used a TF written model, it may be better, and hence the experiments on SSD model.

`**tensorflow/contrib/tensorrt/segment/segment.cc:443] `**There are 3962 ops of 51 different types in the graph that are not converted to TensorRT:** TopKV2, NonMaxSuppressionV2, TensorArrayWriteV3, Const, Squeeze, ResizeBilinear, Maximum, Where, Add, Placeholder, Switch, TensorArrayGatherV3, NextIteration, Greater, TensorArraySizeV3, NoOp, TensorArrayV3, LoopCond, Less, StridedSlice, TensorArrayScatterV3, ExpandDims, Exit, Cast, Identity, Shape, RealDiv, TensorArrayReadV3, Reshape, Merge, Enter, Range, **Conv2D**, Mul, Equal, Sub, Minimum, Tile, Pack, Split, ZerosLike, ConcatV2, Size, Unpack, Assert, DataFormatVecPermute, Transpose, Gather, Exp, Slice, Fill, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).

The rest of the article is more details on how I did this and which you can also follow step by step.

The post that has really helped me was these from Google team -

[1] https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf

I am writing this post as a more detailed explanation to [1], as some parts were not clear when I started following the steps.

My Colab/Jupyter notebook for Optimization is given here; you can skip the article and follow the Notebook also as I have documented it in the notebook. The TF Serving and the Client parts are however in this article.

https://colab.research.google.com/drive/1wQpWoc40kf__WSjfTqDaReMx6fFjUn48

If you have a *frozen* TF graph you can use the following methods to optimize it before using it for inferences.

There are two types of optimization. One to make it faster or smaller in size to run inferences. And the other to change the weights from higher precision to lower precision. Usually from FP32 to FP16 or INT8. For the latter, the GPU should have the ability to run mixed precision operations (Tensor Cores). Usually, NVIDIA’s desktop or laptop class GTX 1080 or similar are restricted from running lower precision operations. NVIDIA’s server-class GPUs support this. Especially the newer GPUs V100, T4, etc. Not all server GPU’s support it.

The GPU I l use is NVIDIA V100 32 GB GPU which has support for mixed precision operations. Also, you need to run the optimization in the GPU that you are optimizing for. Especially if you are using TensorRT.

## Step 0. The model, and the Docker Containers

The first thing that has to be done is to convert the TensorFlow graph to a Frozen Graph. If the graph is Kearns based it is the HD5 format and has to be converted to the TF model and then to the frozen graph. A frozen graph has the value of variables embedded in the graph itself. It is a GrpahDef/protocol buffer (pb) format like a Saved Model only it cannot be retrained.

The model that we are using is the SSD model ** ssd_resnet_50_fpn_coco **form TF model zoo -https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

Docker container used for the optimization is *tensorflow/tensorflow:1.13.0rc1-gpu-jupyter*

`docker run `**--entrypoint=/bin/bash** --runtime=nvidia -it --rm -p 8900:8500 -p 8901:8501 -v /usr/alex/:/coding --net=host tensorflow/tensorflow:1.13.0rc1-gpu-jupyter

once inside

cd /coding

jupyter notebook --allow-root &

Note- I changed the entry point to something more convenient to me than default tf-notebook I believe.

After optimizing, to run inferences I am using the same docker image after installing on that TF serving API’s, as well as headless opencv-python version. This is because we will be converting the optimized model to a TF serving compatible model for inference.

docker run --entrypoint=/bin/bash --env http_proxy=<my proxy> --env https_proxy=<my proxy> --runtime=nvidia -it --rm -p 8900:8500 -p 8901:8501 -v /usr/alex/:/coding --net=host tensorflow/tensorflow:1.13.0rc1-gpu-jupyterpip install tensorflow-serving-api

pip install opencv-python==3.3.0.9

cd coding

python ssd_client_1.py -num_tests=1 -server=127.0.0.1:8500 -batch_size=1 -img_path='../examples/google1.jpg/'

## Step 1. Get the output node names in the Tensorflow Graph

Why is this important? We need to find the output node names of the frozen graph as it is needed to optimize the graph. Note Tensorflow version that is used in TF 1.13

# To Freeze the Saved Model

# We need to freeze the model to do further optimisation on itfrom tensorflow.python.saved_model import tag_constants

from tensorflow.python.tools import freeze_graph

from tensorflow.python import ops

from tensorflow.tools.graph_transforms import TransformGraphdef freeze_model(saved_model_dir, output_node_names, output_filename):

output_graph_filename = os.path.join(saved_model_dir, output_filename)

initializer_nodes = ''

freeze_graph.freeze_graph(

input_saved_model_dir=saved_model_dir,

output_graph=output_graph_filename,

saved_model_tags = tag_constants.SERVING,

output_node_names=output_node_names,

initializer_nodes=initializer_nodes,

input_graph=None,

input_saver=False,

input_binary=False,

input_checkpoint=None,

restore_op_name=None,

filename_tensor_name=None,

clear_devices=True,

input_meta_graph=False,

)

For this, we can plot the model in TF Board and see the output nodes, or print the nodes and grep on some keywords.

# Source https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbfdef get_graph_def_from_file(graph_filepath):

tf.reset_default_graph()

with ops.Graph().as_default():

with tf.gfile.GFile(graph_filepath, 'rb') as f:

graph_def = tf.GraphDef()

graph_def.ParseFromString(f.read())

return graph_def

let us use the above helper to print the input and output nodes, input nodes via the for loop -

`graph_def =get_graph_def_from_file('/coding/ssd_inception_v2_coco_2018_01_28/frozen_inference_graph.pb')`

for node in graph_def.node:

if node.op=='Placeholder':

print node # this will be the input node

and output nodes by plotting it in a format readable by Tensor Board.

`with tf.Session(graph=tf.Graph()) as session:`

mygraph = tf.import_graph_def(graph_def, name='')

writer = tf.summary.FileWriter(**logdir='/coding/log_tb/1'**, graph=session.graph)

writer.flush()

Let us invoke Tensor board.

#ssh -L 6006:127.0.0.1:6006 root@<remoteip> # for tensor board - in your local machine type 127.0.0.1tensorboard--logdir '/coding/log_tb/1'

From this, I could make out the output nodes. Note that if you are building the graph yourself you don’t need to do this circus. Since I am using a model that is opensourced and with less documentation I am using this. Sometimes for auto converted/TF imported graphs, the names will be pretty long. You can then print the nodes in a for a loop as I did for Placeholder and from the output, shape make out ( for detections class, score, rectangle coordinates)

# These are the output names. Add a index usually 0 for graph nodes. # You can print the node details by nodenamesoutput_node_names = ['detection_boxes','detection_scores','detection_classes','num_detections']

outputs = ['detection_boxes:0','detection_scores:0','detection_classes:0','num_detections:0']

## Step 3 Optimise using TF Graph Transform Tools

The snippet below illustrates how you can optimize a graph after reading it from disk.

# Source https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf#https://gist.github.com/lukmanr# Optimizing the graph via TensorFlow libraryfrom tensorflow.tools.graph_transforms import TransformGraphdef optimize_graph(model_dir, graph_filename, transforms, output_names, outname='optimized_model.pb'):

input_names = ['input_image',] # change this as per how you have saved the model

graph_def=get_graph_def_from_file(os.path.join(model_dir, graph_filename))

optimized_graph_def=TransformGraph(

graph_def,

input_names,

output_names,

transforms)

tf.train.write_graph(optimized_graph_def,

logdir=model_dir,

as_text=False,

name=outname)

print('Graph optimized!')

Let us use the above helper to optimize the graph first **quantize_weights**

# Optimization without Qunatization - Reduce the size of the model

# speed may actually be slower

# see https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbftransforms = ['remove_nodes(op=Identity)', \

'merge_duplicate_nodes', \

'strip_unused_nodes',

'fold_constants(ignore_errors=true)',

'fold_batch_norms',

'quantize_weights'] #this reduces the size, but there is no speed up , actaully slows down, see belowoptimize_graph('/coding/ssd_inception_v2_coco_2018_01_28', 'frozen_inference_graph.pb' ,

transforms, output_node_names,outname='optimized_model_small.pb')

Let’s then convert the optimized model to TF serving compatible format.

`#lets convert this to a s TF Serving compatible mode;`

**convert_graph_def_to_saved_model**('/coding/ssd_inception_v2_coco_2018_01_28/2',

'/coding/ssd_inception_v2_coco_2018_01_28/optimized_model_small.pb',outputs)

The helper that does this is given below

# Source https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf#https://gist.github.com/lukmanrdef convert_graph_def_to_saved_model(export_dir, graph_filepath,outputs):graph_def =get_graph_def_from_file(graph_filepath)

with tf.Session(graph=tf.Graph()) as session:

tf.import_graph_def(graph_def, name='')

tf.saved_model.simple_save(

session,

export_dir,# change input_image to node.name if you know the name

inputs={'input_image': session.graph.get_tensor_by_name('{}:0'.format(node.name))

for node in graph_def.node if node.op=='Placeholder'},

outputs={t:session.graph.get_tensor_by_name(t) for t in outputs}

)

print('Optimized graph converted to SavedModel!')

And then **‘quantize_weights’ and ‘quantize_nodes’.**

**This should really covert also the calculation to lower precision - but does not work as of now.**

*"This process converts all the operations in the graph that have eight-bit quantized equivalents and leaves the rest in floating point. Only a subset of ops are supported and on many platforms, the quantized code may actually be slower than the float equivalents, but this is a way of increasing performance substantially when all the circumstances are right.”*

*https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms#optimizing-for-deployment*

transforms = ['add_default_attributes', \

'strip_unused_nodes', \

'remove_nodes(op=Identity, op=CheckNumerics)',\

'fold_constants(ignore_errors=true)',

'fold_batch_norms',

'fold_old_batch_norms',

'quantize_weights',

'quantize_nodes',

'strip_unused_nodes',

'sort_by_execution_order']optimize_graph('/coding/ssd_inception_v2_coco_2018_01_28', 'frozen_inference_graph.pb' ,

transforms, output_node_names,outname='optimized_model_weight_quant.pb')

**However this does not work in the sense, inference using this optimized model gives the error.** I had tried with a Keras model earlier and got another error message. This seems to be a bug as now this model is a pure Tensorflow model and I have not changed anything here

`(‘Got an error’, <_Rendezvous of RPC that terminated with:`

status = StatusCode.INVALID_ARGUMENT

details = “input_max_range must be larger than input_min_range.

[[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/**mul_eightbit**/Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/sub_1/quantize}}]]

[[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/zeros_like_83}}]]”

debug_error_string = **“{“created”:”****@1555723203****.356344655",”description”:”Error received from peer”,”file”:”src/core/lib/surface/call.cc”,”file_line”:1036,”grpc_message”:”input_max_range must be larger than input_min_range.\n\t [[{{node **Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/mul_eightbit/Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/sub_1/quantize}}]]\n\t [[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/zeros_like_83}}]]”,”grpc_status”:3}”

>)

Response Received Exiting

## Step 4 Optimise using NVIDIA TenosrRT

Base reference for this is these two posts

https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html

https://developers.googleblog.com/2018/03/tensorrt-integration-with-tensorflow.html

Inference with TF-TRT `SavedModel` workflow: we are using the TF Serving model.

`import tensorflow.contrib.tensorrt as trt`

tf.reset_default_graph()

graph = tf.Graph()

sess = tf.Session()

# Create a TensorRT inference graph from a SavedModel:

with graph.as_default():

with tf.Session() as sess:

trt_graph = trt.create_inference_graph(

input_graph_def=None,

outputs=outputs,

**input_saved_model_dir**=**'/coding/ssd_inception_v2_coco_2018_01_28/01'**,

input_saved_model_tags=['**serve**'],

max_batch_size=1,

max_workspace_size_bytes=7000000000,

**precision_mode='FP16')**

#precision_mode='FP32')

#precision_mode='**INT8**')

output_node=tf.import_graph_def(trt_graph, return_elements=outputs)

#sess.run(output_node)

tf.saved_model.simple_save(sess,

"/coding/ssd_inception_v2_coco_2018_01_28/4",

inputs={'input_image': graph.get_tensor_by_name('{}:0'.format(node.name))

for node in graph.as_graph_def().node if node.op=='Placeholder'},

outputs={t:graph.get_tensor_by_name('import/'+t) for t in outputs}

)

**Inference with TF-TRT `Frozen` graph workflow:**

Reference https://medium.com/tensorflow/speed-up-tensorflow-inference-on-gpus-with-tensorrt-13b49f3db3fa

`#Lets load a frozen model and reset the graph and use`

gdef =**get_graph_def_from_file**(‘/coding/ssd_inception_v2_coco_2018_01_28/frozen_inference_graph.pb’)

tf.**reset_default_graph**()

graph = tf.Graph()

sess = tf.Session()

# Create a TensorRT inference graph from a SavedModel:

with graph.as_default():

with tf.Session() as sess:

trt_graph = **trt**.**create_inference_graph**(

**input_graph_def=gdef,**

outputs=outputs,

max_batch_size=8,

max_workspace_size_bytes=7000000000,

is_dynamic_op=True,

#precision_mode=’FP16')

#precision_mode=’FP32')

**precision_mode=’INT8')**

output_node=tf.import_graph_def(trt_graph, return_elements=outputs)

#sess.run(output_node)

tf.saved_model.simple_save(sess,

“/coding/ssd_inception_v2_coco_2018_01_28/5”,

inputs={‘input_image’: graph.get_tensor_by_name(‘{}:0’.format(node.name))

for node in graph.as_graph_def().node if node.op==’Placeholder’},

outputs={t:graph.get_tensor_by_name(‘import/’+t) for t in outputs}

)

**Step 5: Pause and Check the models**

The outputs of the various models are given below. You can see that the model size reduces after optimizations.

**Original model **('/coding/ssd_inception_v2_coco_2018_01_28/frozen_inference_graph.pb', '')

Model size: 99591.409 KB

Variables size: 0.0 KB

**Total Size: 99591.409 KB **

---------Tensorflow Transform Optimised model **Weights Quantised **('/coding/ssd_inception_v2_coco_2018_01_28/2/saved_model.pb', '') Model size: 26193.27 KB

Variables size: 0.0 KB

** Total Size: 26193.27 KB**

---------Tensorflow Transform Optimised model Weights and Nodes Quantised ('/coding/ssd_inception_v2_coco_2018_01_28/3/saved_model.pb', '') Model size: 29265.284 KB

Variables size: 0.0 KB

Total Size: 29265.284 KB

---------NVIDIA RT Optimised model **FP16** ('/coding/ssd_inception_v2_coco_2018_01_28/4/saved_model.pb', '') Model size: 178564.229 KB

Variables size: 0.0 KB

**Total Size: 178564.229 KB**

---------NVIDIA RT Optimised model **INT8** ('/coding/ssd_inception_v2_coco_2018_01_28/5/saved_model.pb', '') Model size: 178152.834 KB

Variables size: 0.0 KB

**Total Size: 178152.834 KB**

## Step 6: Ready the TF Serving container to server these models

Note the container we are using here — Client

docker run --entrypoint=/bin/bash --env http_proxy=<my proxy> --env https_proxy=<my proxy> --runtime=nvidia -it --rm -p 8900:8500 -p 8901:8501 -v /usr/alex/:/coding --net=host tensorflow/tensorflow:1.13.0rc1-gpu-jupyterpip install tensorflow-serving-api

pip install opencv-python==3.3.0.9

cd coding

python ssd_client_1.py -num_tests=1 -server=127.0.0.1:8500 -batch_size=1 -img_path='../examples/google1.jpg/'

Server -This is pasted from Step 0. This is run in the V100 32 GB Linux/machine.

`docker run --net=host --runtime=nvidia -it --rm -p 8900:8500 -p 8901:8501 -v /usr/alex/:/models tensorflow/serving:1.13.0-gpu --rest_api_port=0 --enable_batching=true --model_config_file=/models/ssd_inception_v3_coco.json`

where the ** config json** is like below. Since I have placed the different models in folders under “/models/ssd_inception_v2_coco_2018_01_28/” as 01 — original model, 2-TF Graph Transform Weight Quantized, 3- TF Graph Transform Weight and Node Quantized,4-TensorRT FP16,5-TensorRT INT8; I just change the versions in the file to load different servables for each test.

`model_config_list {`

config {

name: "ssd_inception_v2_coco",

base_path: "/models/ssd_inception_v2_coco_2018_01_28/",

model_version_policy: {

specific: {

**versions:[01]**

}

},

model_platform:"tensorflow",

}

}

## Step 7: Write a TF Serving Client for tests

I have written about this in detail in a previous post.

The saved model of the SSD is like below You can use the saved model CLI to view it

saved_model_cli show --dir '/coding/ssd_inception_v2_coco_2018_01_28/3' --allMetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:signature_def['serving_default']:

The given SavedModel SignatureDef contains the following input(s):

inputs['input_image']tensor_info:

dtype: DT_UINT8

shape: (-1, -1, -1, 3)

name: image_tensor:0

The given SavedModel SignatureDef contains the following output(s):

outputs['detection_boxes:0'] tensor_info:

dtype: DT_FLOAT

shape: unknown_rank

name: detection_boxes:0

outputs['detection_classes:0'] tensor_info:

dtype: DT_FLOAT

shape: unknown_rank

name: detection_classes:0

outputs['detection_scores:0'] tensor_info:

dtype: DT_FLOAT

shape: unknown_rank

name: detection_scores:0

outputs['num_detections:0'] tensor_info:

dtype: DT_FLOAT

shape: unknown_rank

name: num_detections:0

Method name is: tensorflow/serving/predict

Note that in this, the input and output node names are slightly different from the original model- whose input is ‘inputs’ and output is ‘detection_boxes’,’detection_classes’,’detection_scores’ (without the :0 part- which is a deficiency in the conversion scripts that I have used- but can be rectified easily)

Original model

`root@ndn-oe:/coding/tfclient# saved_model_cli show - dir /coding/ssd_inception_v2_coco_2018_01_28/01/ - all`

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:

The given SavedModel SignatureDef contains the following input(s):

inputs['inputs'] tensor_info:

dtype: DT_UINT8

shape: (-1, -1, -1, 3)

name: image_tensor:0

The given SavedModel SignatureDef contains the following output(s):

outputs['detection_boxes'] tensor_info:

dtype: DT_FLOAT

shape: (-1, 100, 4)

name: detection_boxes:0

outputs['detection_classes'] tensor_info:

dtype: DT_FLOAT

shape: (-1, 100)

name: detection_classes:0

outputs['detection_scores'] tensor_info:

dtype: DT_FLOAT

shape: (-1, 100)

name: detection_scores:0

outputs['num_detections'] tensor_info:

dtype: DT_FLOAT

shape: (-1)

name: num_detections:0

Method name is: tensorflow/serving/predict

The TF Serving client is given here -https://gist.github.com/alexcpn/d7c28230af437dafb0d2cc7f50140eed

The rest of the imports are here, the client is slightly different, the names of inputs and outputs, that’s why it is on gist https://github.com/alexcpn/tf_serving_clients

The image file used for the test is https://github.com/fizyr/keras-retinanet/blob/master/examples/000000008021.jpg

## Step 8: The Output from various models

Basically, there is hardly any difference between the optimized and non-optimized model. Batch size is one here.

More details below

Original ModelInvocaiton :

coding/tfclient# python ssd_client_1.py -num_tests=1 -server=127.0.0.1:8500 -batch_size=1 -img_path=’../examples/000000008021.jpg’

(‘Image path’, ‘../examples/000000008021.jpg’)

(‘original image shape=’, (480, 640, 3))

(‘Input-s shape’, (1, 800, 1066, 3)) → This is the size of input tensorOuput

(‘Label’, u’person’, ‘ at ‘, array([412, 171, 740, 624]), ‘ Score ‘, 0.9980476)

(‘Label’, u’person’, ‘ at ‘, array([ 6, 423, 518, 788]), ‘ Score ‘, 0.94931936)

(‘Label’, u’person’, ‘ at ‘, array([ 732, 473, 1065, 793]), ‘ Score ‘, 0.88419175)

(‘Label’, u’tie’, ‘ at ‘, array([529, 337, 565, 494]), ‘ Score ‘, 0.40442815)

(‘Time for ‘, 1, ‘ is ‘,0.5993821620941162)

Tensorflow Transform Optimised model Weights Quantized(‘Label’, u’person’, ‘ at ‘, array([409, 174, 741, 626]), ‘ Score ‘, 0.99797523)

(‘Label’, u’person’, ‘ at ‘, array([ 4, 424, 524, 790]), ‘ Score ‘, 0.9549346)

(‘Label’, u’person’, ‘ at ‘, array([ 725, 472, 1064, 793]), ‘ Score ‘, 0.8900732)

(‘Label’, u’tie’, ‘ at ‘, array([527, 338, 566, 494]), ‘ Score ‘, 0.3943166)(‘Time for ‘, 1, ‘ is ‘, 0.6182711124420 → This is higher a model size is reduced and during inference the higher precision coversion has to be done

*You should see that the size of the output graph is about a quarter of the original. The downside to this approach compared to round_weights is that extra decompression ops are inserted to convert the eight-bit values back into floating point, but optimizations in TensorFlow’s runtime should ensure these results are cached and so you shouldn’t see the graph run any more slowly.- *https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md

TensorRT FP 16 Converted model

(‘Label’, u’person’, ‘ at ‘, array([412, 171, 740, 624]), ‘ Score ‘, 0.9980476)

(‘Label’, u’person’, ‘ at ‘, array([ 6, 423, 518, 788]), ‘ Score ‘, 0.9493193)

(‘Label’, u’person’, ‘ at ‘, array([ 732, 473, 1065, 793]), ‘ Score ‘, 0.8841917)

(‘Label’, u’tie’, ‘ at ‘, array([529, 337, 565, 494]), ‘ Score ‘, 0.40442812)(‘Time for ‘, 1, ‘ is ‘, 0.5885560512542725) →I was hoping this would be half the original value — twice as fast. But during optimization TensorRT was telling it could convert only a few of the supported* operations -

`"`

though Convolution operation is shown as supported here →https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html. Bug raised for this by me hereThere are 3962 ops of 51 different types in the graph that are not converted to TensorRT -Conv2D"

`2019-04-14 08:32:31.357592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:`

2019-04-14 08:32:31.357620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0

2019-04-14 08:32:31.357645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N

2019-04-14 08:32:31.358154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30480 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:b3:00.0, compute capability: 7.0)

2019-04-14 08:32:34.582872: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1

2019-04-14 08:32:34.583019: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session

2019-04-14 08:32:34.583578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0

2019-04-14 08:32:34.583610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-04-14 08:32:34.583636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0

2019-04-14 08:32:34.583657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N

2019-04-14 08:32:34.583986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30480 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:b3:00.0, compute capability: 7.0)

2019-04-14 08:32:36.713848: I tensorflow/contrib/tensorrt/segment/segment.cc:443] **There are 3962 ops of 51 different types in the graph that are not converted to TensorRT:** TopKV2, NonMaxSuppressionV2, TensorArrayWriteV3, Const, Squeeze, ResizeBilinear, Maximum, Where, Add, Placeholder, Switch, TensorArrayGatherV3, NextIteration, Greater, TensorArraySizeV3, NoOp, TensorArrayV3, LoopCond, Less, StridedSlice, TensorArrayScatterV3, ExpandDims, Exit, Cast, Identity, Shape, RealDiv, TensorArrayReadV3, Reshape, Merge, Enter, Range, **Conv2D**, Mul, Equal, Sub, Minimum, Tile, Pack, Split, ZerosLike, ConcatV2, Size, Unpack, Assert, DataFormatVecPermute, Transpose, Gather, Exp, Slice, Fill, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).

2019-04-14 08:32:36.848171: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:913] Number of TensorRT candidate segments: 4

2019-04-14 08:32:37.129266: W tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3710] Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,?,?,3] has an unknown non-batch dimension at dim 1

2019-04-14 08:32:37.129330: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:1021] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 707 nodes failed: Invalid argument: Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,?,?,3] has an unknown non-batch dimension at dim 1. Fallback to TF...

2019-04-14 08:32:37.129838: W tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3710] Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,546,?,?] has an unknown non-batch dimension at dim 2

2019-04-14 08:32:37.129859: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:1021] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 3 nodes failed: Invalid argument: Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,546,?,?] has an unknown non-batch dimension at dim 2. Fallback to TF...

2019-04-14 08:32:38.309554: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 3 nodes succeeded.

2019-04-14 08:32:38.420585: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 4 nodes succeeded.

2019-04-14 08:32:38.644767: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] Optimization results for grappler item: tf_graph

2019-04-14 08:32:38.644837: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] constant folding: Graph size after: 6411 nodes (-1212), 10503 edges (-1352), time = 848.996ms.

2019-04-14 08:32:38.644858: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] layout: Graph size after: 6442 nodes (31), 10535 edges (32), time = 225.361ms.

2019-04-14 08:32:38.644874: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] constant folding: Graph size after: 6432 nodes (-10), 10535 edges (0), time = 559.352ms.

2019-04-14 08:32:38.644920: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] TensorRTOptimizer: Graph size after: 6427 nodes (-5), 10530 edges (-5), time = 2087.5769ms.

TensorRT INT 8 Converted model

One can see from the V100 server logs some Tensor Core magic happening

`2019–04–20 01:30:39.563827: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:574] Starting calibration thread on device 0, Calibration Resource @ 0x7f4c341ac570`

2019–04–20 01:30:39.563982: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:574] Starting calibration thread on device 0, Calibration Resource @ 0x7f4ce8008e60

(‘Label’, u’person’, ‘ at ‘, array([412, 171, 740, 624]), ‘ Score ‘, 0.9980476)

(‘Label’, u’person’, ‘ at ‘, array([ 6, 423, 518, 788]), ‘ Score ‘, 0.9493195)

(‘Label’, u’person’, ‘ at ‘, array([ 732, 473, 1065, 793]), ‘ Score ‘, 0.8841919)

(‘Label’, u’tie’, ‘ at ‘, array([529, 337, 565, 494]), ‘ Score ‘, 0.40442798)(‘Time for ‘, 1, ‘ is ‘, 0.5967140197753906)

**With batch size 2 there is an error/ out of memory for TensorCores**

`python ssd_client_1.py -num_tests=1 -server=127.0.0.1:8500 -`**batch_size=2 **-img_path=’../examples/000000008021.jpg’

2019–04–20 01:34:25.042337: F external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/**trt_engine_op.cc:227] Check failed: t.TotalBytes() == device_tensor->TotalBytes()** (788424 vs. 394212)

2019–04–20 01:34:25.042373: F external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:227] Check failed: t.TotalBytes() == device_tensor->TotalBytes() (34656 vs. 17328)

/usr/bin/tf_serving_entrypoint.sh: line 3: 6 Aborted (core dumped)

## Results from other models (and Comparison with different GPU’s)

Here are some results from other tests and models

Details here — https://docs.google.com/spreadsheets/d/1Sl7K6sa96wub1OXcneMk1txthQfh63b0H5mwygyVQlE/edit?usp=sharing

**Model — Resnet_50 FP 32 and FP16**

You can see that there is a slight difference, V100 32 GB takes slightly less time than the consumer grade GTX 1070 8GB, when the batch size increases the more memory resource of V100 stands out; but not the number of CUDA cores. It seems as is noted in other blogs, that simply having more CUDA cores does not automatically mean that an inference will run faster. It may depend on memory and the model characteristics also.

**Model Retinanet**

One can see here that there is not much difference. Actually, this was my first experiment, but this was a Keras model that was converted to a TF frozen model and optimised. So I thought maybe I would get better results from a pure TF written model like SSD. But did not make much difference.

**Summary**

One can see that there are no drastic improvements in the inference time between the models. Also, TF GraphTransform for Model Quantization has not worked for me in this nor one other model I tried. Will raise a bug for that.TensorRT is better but is only able to convert a few layers to lower precision- have raised a bug/clarification for this, and if that works, hopefully, we can see the models runs twice as fast as advertised in Tensor Cores.

## Main References

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md

https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf

https://colab.research.google.com/drive/1wQpWoc40kf__WSjfTqDaReMx6fFjUn48

**Other related posts**