
Nvidia GPU is the most popular hardware to accelerate the training and inference of your deep learning models. Almost all deep learning frameworks, including PyTorch, TensorFlow, Mxnet, and ONNX, have utilized CUDA to boost the matrix computing efficiency under Nvidia GPUs. But what we will talk about here is deployment, rather than training. How to deploy trained models easily together with some appealing features like model management and model microservices is an important aspect that should be carefully considered in industry application. Unlike TensorFlow, which provides a high-performance serving framework called tensorflow serving quite early, PyTorch’s official serving framework seems to be a little bit late. On one hand, we can use a pure python serving framework like FLASK integrated with PyTorch easily to get a flexible pure Python backend service. On the other hand, according to this post, we can use these third-party model servers like MLFlow, KubeFlow, and RedisAI. But here we will introduce how to use TensorRT inference server to work as the inference solution on Nvidia GPUs, which supports multiple backend frameworks including TensorRT, TensorFlow, ONNX, PyTorch, and Caffe2.
1. INTRODUCTION
Like TensorFlow Serving, TensorRT inference serving framework also has several useful features like multiple framework support, concurrent model execution support, batching support, multiple GPU support and so on. For detailed information, see here for a detailed function explanation. But in this post, we will focus on the deployment of ONNX and inference via HTTP endpoint.
2. INSTALLATION
To test this framework, we should setup the server and client environment. To setup the server environment, we could just simply pull the inference server docker container from NGC container registry.
$ docker pull nvcr.io/nvidia/tensorrtserver:18.09-py3
$ nvidia-docker run -it --rm nvcr.io/nvidia/tensorrtserver:18.09-py3
# find trtserver under /opt/tensorrtserver
$ trtserver -hNOTE: use nvidia-docker instead of original docker to make sure you could utilize GPUs. To install nvidia-docker, see this post.
To install the client environment, the inference server repository has provided us a dockerfile to ease the work.
$ git clone https://github.com/NVIDIA/tensorrt-inference-server.git
$ cd tensorrt-inference-server
$ sudo nvidia-docker build --rm -t trt_client:v1 -f Dockerfile.client .3. MODEL REPOSITORY
TensorRT server only recognizes the inference models under a specific structure. A typical model repository layout is like this one:
<model-repository-path>/
model_0/
config.pbtxt
output0_labels.txt
1/
model.plan
2/
model.plan
model_1/
config.pbtxt
output0_labels.txt
output1_labels.txt
0/
model.onnx
7/
model.onnxAny number of models could be specified and they will be loaded when the server starts if they have no problems. Briefly explanation here:
- Each subdirectory name (model_0 and model_1) should be unique which will be used in later interaction between the client and server to identify which model to use.
- The config.pbtxt provides the metadata to the trtserver (it’s optional for some frameworks like ONNX and TensorRT), and the model name configured should match the subdirectory name. Besides, each directory should have at least 1 numeric subdirectory which indicates the model version. But if the name has zero prefix, this model will be ignored. By default, the larger the number, the newer the model. And the trtserver will attempt to load the newer one as the default model for inference if the model version is not specified. Of course, multiple model versions could coexist under a model directory, which allows us to hot-load model or to do A-B test easily (see more version policy).
- The *_label.txt is optional which is used to provide labels for outputs.
- The model.* is the model for load and unload. Different model suffix represents the corresponding framework and it should match the definition of platform in config.pbtxt.
- model.plan for TensorRT models
- model.graphdef for TensorFlow GraphDef models
- model.savedmodel for TensorFlow SavedModel models
- model.onnx for ONNX Runtime ONNX models
- model.pt for PyTorch TorchScript models
- model.netdef and init_model.netdef for Caffe2 Netdef models
4. MODLE CONFIGURATION
In general, the model configuration must be included in each model repository. But if starting the trtserver with this tag --strict-model-config=false, in some cases, the model configuration file could be generated automatically, including ONNX, TensorRT, and TensorFlow SavedModel. Here let’s use ONNX as an example to illustrate this, although it could be ignored.
name: "onnx_model"
platform: "onnxruntime_onnx"
max_batch_size: 1
input [
{
name: "gpu_0/data"
data_type: TYPE_FP32
dims: [ 1, 3, 112, 112]
# format: FORMAT_NCHW
# dims: [ 3, 224, 224 ]
}
]
output [
{
name: "gpu_0/softmax"
data_type: TYPE_FP32
dims: [ 1, 512]
}
]
version_policy: { all { }}
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]A minimal model configuration must specify name, platform, max_batch_size, input, and output.
- name: defines the name of the model
- platform: defines the type of server
- max_batch_size: defines the max batch size allowed for inference. If the value is 0, it means no batch is allowed for this model.
- input: a list of inputs required for performing inference, where name is the input node name, data_type and dims indicate how to interpret input tensors and format is optional but to represent how to interpret the input. For example, FORMAT_NCHW indicates the input tensor represents an image in CHW format, only 3 dims. However, in our case, I haven’t found a simple way to freeze model with only 3 dims discarding the first dim, the Batch axis. So if the input dim contains 4 values [N, C, H, W], the format argument shouldn’t be specified, or otherwise this error would be reported.
error: failed to poll model repository: INVALID_ARG — model input NHWC/NCHW require 3 dims for onnx_model- output: similar to input primarily to indicate the output node exposed to trt server.
- version_policy: indicates model version control policy for trt server. If not specified, latest is used as default, meaning only the most recent version of the model is available to the inference server. If all, all models are available.
- instance_group: indicating how many execution instances of a model will be served to handle multiple simultaneous inference requests.
- Other configurations could refer to this doc.
5. RUNNING TENSORRT INFERENCE SERVER
After setting up the model repository and server environment, the trt server should be ready to run and make the models available for inference. Since both our server and client are installed in docker containers, we need to build a docker network to enable communication between these two containers.
$ sudo docker network create trt_serving_network
$ sudo nvidia-docker run --rm -p8000:8000 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net trt_serving_network --network-alias=trt_server -v/path/to/serving/models:/models nvcr.io/nvidia/tensorrtserver:19.09-py3 trtserver --model-repository=/models --strict-model-config=false-p: tells docker to link host port 8000 with container port 8000, while trt server listens to HTTP requests on port 8000 by default. If needed modification, use--http-portflag.--shm-sizeand--ulimitflags are used to improve to server’s performance.--netis the bridge like network name we created before for all containers to communicate under the same network. and--network-aliasdefines the alias link name to this container. The detailed docker network explanation could be found here and this post briefly introduces how to communicate between containers.-vis used to manage the data and model shared between the container and the host. The part before:is the model repository path in your host and the part after the:is the model repository path mounted in the container.--model-repositorydefines the directory pointed to pick up the server models.--strict-model-config, as said before, defines if the model configuration file is required or not.
If the models are configured and exported in the right format, then you will see the following log:
===============================
== TensorRT Inference Server ==
===============================NVIDIA Release 19.09 (build 8086825)Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.I1029 06:17:31.108835 1 metrics.cc:160] found 1 GPUs supporting NVML metrics
I1029 06:17:31.114313 1 metrics.cc:169] GPU 0: GeForce GTX 1080 Ti
I1029 06:17:31.114546 1 server.cc:110] Initializing TensorRT Inference Server
I1029 06:17:42.388722 1 server_status.cc:83] New status tracking for model 'onnx_model_112'
I1029 06:17:42.388748 1 server_status.cc:83] New status tracking for model 'onnx_model_224'
I1029 06:17:42.388755 1 server_status.cc:83] New status tracking for model 'plan_model'
I1029 06:17:42.388868 1 model_repository_manager.cc:668] loading: onnx_model_224:1
I1029 06:17:42.388975 1 model_repository_manager.cc:668] loading: plan_model:1
I1029 06:17:42.389088 1 model_repository_manager.cc:668] loading: onnx_model_112:1
I1029 06:17:42.549699 1 onnx_backend.cc:188] Creating instance onnx_model_112_0_gpu0 on GPU 0 (6.1) using model.onnx
I1029 06:17:42.728753 1 onnx_backend.cc:188] Creating instance onnx_model_224_0_gpu0 on GPU 0 (6.1) using model.onnx
2019-10-29 06:17:42.865292495 [W:onnxruntime:log, cuda_execution_provider.cc:1086 GetCapability] Fallback to CPU execution provider for Op type: BatchNormalization node name:
2019-10-29 06:17:42.865365409 [W:onnxruntime:log, cuda_execution_provider.cc:1086 GetCapability] Fallback to CPU execution provider for Op type: MaxPool node name:
I1029 06:17:43.004835 1 model_repository_manager.cc:810] successfully loaded 'onnx_model_112' version 1
2019-10-29 06:17:43.265908427 [W:onnxruntime:log, cuda_execution_provider.cc:1086 GetCapability] Fallback to CPU execution provider for Op type: BatchNormalization node name:
2019-10-29 06:17:43.265972446 [W:onnxruntime:log, cuda_execution_provider.cc:1086 GetCapability] Fallback to CPU execution provider for Op type: MaxPool node name:
I1029 06:17:43.294153 1 plan_backend.cc:239] Creating instance plan_model_0_gpu0 on GPU 0 (6.1) using model.plan
I1029 06:17:43.524769 1 model_repository_manager.cc:810] successfully loaded 'onnx_model_224' version 1
I1029 06:17:45.986797 1 plan_backend.cc:401] Created instance plan_model_0_gpu0 on GPU 0 (6.1) with stream priority 0
I1029 06:17:46.010804 1 model_repository_manager.cc:810] successfully loaded 'plan_model' version 1
I1029 06:17:46.010899 1 main.cc:417] Starting endpoints, 'inference:0' listening on
I1029 06:17:46.013894 1 grpc_server.cc:1730] Started GRPCService at 0.0.0.0:8001
I1029 06:17:46.013928 1 http_server.cc:1125] Starting HTTPService at 0.0.0.0:8000
I1029 06:17:46.056275 1 http_server.cc:1139] Starting Metrics Service at 0.0.0.0:8002
6. RUNNING INFERENCE CLIENT
Like the server environment, we need to initialize the client environment using docker containers. But to simplify the example, we will run docker in interactive mode and run example manually.
$ sudo nvidia-docker run --rm -it --net=trt_serving_network trt_client:v1 bashBefore requesting the inference server, we should make sure the model is ready for inference on the client.
$ curl trt_server:8000/api/status/onnx_model_224
id: "inference:0"
version: "1.6.0"
uptime_ns: 272655396778921
model_status {
key: "onnx_model_224"
value {
config {
name: "onnx_model_224"
platform: "onnxruntime_onnx"
version_policy {
latest {
num_versions: 1
}
}
input {
name: "input.1"
data_type: TYPE_FP32
dims: 1
dims: 3
dims: 224
dims: 224
}
output {
name: "558"
data_type: TYPE_FP32
dims: 1
dims: 512
}
instance_group {
name: "onnx_model_224"
count: 1
gpus: 0
kind: KIND_GPU
}
default_model_filename: "model.onnx"
}
version_status {
key: 1
value {
ready_state: MODEL_READY
}
}
}
}
ready_state: SERVER_READYNOTE: the trt_server is the network alias we defined in the server container and onnx_model_224 is model for inference.
In the directory tensorrt-inference-server/src/client, there are several client examples that help us understand how to use the client to communicate with server via HTTP. I have modifiedimage_client.py to accept an image and output a feature tensor.
Note that the initialized model could run synchronously or asynchronously. As the word’s literal meaning, run of InferContext will wait until the result is returned, while async_run will return the request ID immediately and make inference run in the background. Then the result could be accessed through running get_async_run_results method according to the request ID returned by async_run. I think async_run is frequently used when the server needs to process large batch of inputs.
Then run this script in your client docker. Here is what you will get the returned [1, 512] dim vector.
But, as I mentioned in the model configuration part, I haven’t solved the problem quite well that how to freeze a PyTorch or MXNet model to ONNX with batch axis discarded, which means that the input tensor should have 3 dims CHW rather than normal NCHW. As a result, the dynamic batching function of this TensorRT inference server couldn’t be used well.
As for the Performance, when setting the input batch to 10, my initial ONNX model needs 5ms when using onnxruntime-gpu to run. Under this inference server, the inference time could be reduced to 2ms on average. More than 1.5x acceleration.
7. SUMMARY
Till now, we have already wandered through how to use TensorRT inference server to deploy ONNX model for inference service. But this is just a beginning, there are still a lot of powerful functions for exploring, such as custom model backend, multiple-instance for performance improvement, model assemble support and so on. I believe this post and the official doc will give you a good start to utilize this tool to boost your inference service.