Optimizing TensorFlow Serving performance with NVIDIA TensorRT

Posted by Guangda Lai, Gautam Vasudevan, Abhijit Karmarkar, Smit Hinsu

TensorFlow Serving is a flexible, high-performance serving system for machine learning models, NVIDIA TensorRT is a platform for high-performance deep learning inference, and by combining the two, users can get better performance for GPU inference in a simple way. The TensorFlow team worked with NVIDIA and added initial support for TensorRT in TensorFlow v1.7, since then we’ve been working closely to improve the TensorFlow-TensorRT integration (referred to as TF-TRT), and now it is ready in TensorFlow Serving 1.13 and coming soon to TensorFlow 2.0.

In a previous blog post, we introduced how to use TensorFlow Serving with Docker, and in this post we’ll show how easy it is to run a TF-TRT converted model in the same way. As before, let’s try putting the ResNet model into production. All the examples below runs on a workstation with a Titan-V GPU.

Serve ResNet with TensorFlow Serving on GPU

For this exercise, we will simply download a pre-trained ResNet SavedModel:

$ mkdir /tmp/resnet
$ curl -s https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | tar --strip-components=2 -C /tmp/resnet -xvz
$ ls /tmp/resnet
1538687457

In the previous blog post, we demonstrated how to serve the model using a TensorFlow Serving CPU docker image. Here let’s run the GPU docker image (see here for instructions) to serve and test this model with GPU:

$ docker pull tensorflow/serving:latest-gpu
$ docker run --rm --runtime=nvidia -p 8501:8501 --name tfserving_resnet \
    -v /tmp/resnet:/models/resnet -e MODEL_NAME=resnet -t tensorflow/serving:latest-gpu &

… server.cc:313] Running gRPC ModelServer at 0.0.0.0:8500 …
… server.cc:333] Exporting HTTP/REST API at:localhost:8501 …
$ curl -o /tmp/resnet/resnet_client.py https://raw.githubusercontent.com/tensorflow/serving/master/tensorflow_serving/example/resnet_client.py
$ python /tmp/resnet/resnet_client.py
Prediction class: 286, avg latency: 18.0469 ms

The docker run command launches a TensorFlow Serving server to serve the downloaded SavedModel in /tmp/resnet, and exposes a REST API port 8501 in the host. The resnet_client.py sends some images to the server and gets back predictions. Now let’s terminate the TensorFlow Serving container to release the GPU resource:

$ docker kill tfserving_resnet

Convert and serve the model with TF-TRT

Now that we have our working model, in order to get the benefits from TensorRT, we need to convert it to a model that runs the operations using TensorRT, by running the conversion commands inside the Tensorflow Serving docker container:

$ docker pull tensorflow/tensorflow:latest-gpu
$ docker run --rm --runtime=nvidia -it \
-v /tmp:/tmp tensorflow/tensorflow:latest-gpu \
/usr/local/bin/saved_model_cli convert \
--dir /tmp/resnet/1538687457 \
--output_dir /tmp/resnet_trt/1538687457 \
--tag_set serve \
tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op True

Here, we run the saved_model_cli command line tool, which has built-in support for TF-TRT conversion. The --dir and--output_dir parameters tells it where to find the SavedModel and to output the converted SavedModel, and --tag_set tells it which graph in the SavedModel to convert. Then we explicitly tell it to run the TF-TRT converter by passing tensorrt in the command line, and specifying the configurations:

  • --precision_mode tells the converter which precision to use, currently it only supports FP32 and FP16.
  • --max_batch_size tells the maximum batch size of the input. The converter requires that all tensors that will be handled by TensorRT have their first dimension as the batch dimension, and this parameter tells it what the max value would be during inference. If the actual max batch size during inference is known and this value matches that, the converted model will be optimal. Note that the converted model cannot handle inputs with batch size larger than what is specified here, but smaller is fine.
  • --is_dynamic_op tells it to do the actual conversion at model running time. This is because TensorRT requires all shapes to be known at conversion time. For the ResNet model used in this example, its tensors don’t have fixed shapes, and that’s why we need this parameter.

As simple as before, we can now serve the TF-TRT converted model with Docker, by just pointing the right directory to it:

$ docker run --rm --runtime=nvidia -p 8501:8501 \
--name tfserving_resnet \
-v /tmp/resnet_trt:/models/resnet \
-e MODEL_NAME=resnet \
-t tensorflow/serving:latest-gpu &

… server.cc:313] Running gRPC ModelServer at 0.0.0.0:8500 …
… server.cc:333] Exporting HTTP/REST API at:localhost:8501 …

And send requests to it:

$ python /tmp/resnet/resnet_client.py
Prediction class: 286, avg latency: 15.0287 ms

At last, feel free to kill the container:

$ docker kill tfserving_resnet

As we can see, bringing up a TF-TRT converted model using TensorFlow Serving and Docker is as simple as serving a regular model. Also, as a demonstration, the performance numbers only applies to the model that we’re using and the machine that runs this example, but it does show performance benefits of using TF-TRT.

TensorFlow 2.0 is coming, and the TensorFlow team and NVIDIA are working together to make sure TF-TRT works smoothly in 2.0. Please refer to the TF-TRT github repository for the most up-to-date information.