Yolov3 with tensorrt-inference-server

楊亮魯
4 min readSep 15, 2019

--

In this article, you will learn how to run a tensorrt-inference-server and client. And will use yolov3 as an example

requirement: nvidia-docker installed, test your env first with:

docker run --rm --gpus all nvcr.io/nvidia/tensorrtserver:19.10-py3 nvidia-smi

there will be 3 parts in this article:

  • setup inference-sever first
  • prepare yolov3 tensorrt engine
  • prepare yolov3 inference client

1. setup inference-sever first

the architecture of tensorRT inference server is quite awesome which supports frameworks like tensorrt, tensorflow, caffe2, and also a scheduler mechanism implemented.

but, I didn’t dive so deep yet, let’s just follow the quick start instruction first.

download example models into model_path

git clone https://github.com/NVIDIA/tensorrt-inference-server
git checkout r19.09
cd docs/examples
./fetch_models.sh
# model stores in tensorrt-inference-server/docs/examples/model_repository

start-server:

export model_path=$PWD/docs/examples/model_repository
docker run \
--runtime nvidia \
--rm --shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
--name trt_serving \
-v $model_path:/models \
nvcr.io/nvidia/tensorrtserver:19.-py3 \
trtserver --model-store=/models

and now you can use curl to check the model status

curl localhost:8000/api/status# which should get something like this
model_status {
key: "densenet_onnx"
value {
config {
name: "densenet_onnx"
platform: "onnxruntime_onnx"
version_policy {
latest {
num_versions: 1
}
}
input {
name: "data_0"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: 3
dims: 224
dims: 224
reshape {
shape: 1
shape: 3
shape: 224
shape: 224
}
}
output {
name: "fc6_1"
data_type: TYPE_FP32
dims: 1000
label_filename: "densenet_labels.txt"
reshape {
shape: 1
shape: 1000
shape: 1
shape: 1
}
}
instance_group {
name: "densenet_onnx"
count: 1
gpus: 0
kind: KIND_GPU
}
default_model_filename: "model.onnx"
}
version_status {
key: 1
value {
ready_state: MODEL_READY
}
}
}
}

ok, let’s try a resnet50 example with tensorrt-inference client

docker build -t tensorrtserver_client -f Dockerfile.client .
docker run -it --rm --net=host tensorrtserver_client
python src/clients/python/image_client.py -m resnet50_netdef -s INCEPTION images/mug.jpg
#Request 0, batch size 1 #Image 'images/mug.jpg': # 504 (COFFEE MUG) = 0.777365028858

seems great right? we have quickly run a server and client example, let’s close the server and client first.

2. prepare yolov3 tensorrt engine

follow the guide to build a yolov3 tensorrt engine

2.1 start tensorrt container

docker run \
--gpus all \
-v $PWD/trt:/workspace/trt \
--name trt \
-ti nvcr.io/nvidia/tensorrt:19.10-py2 /bin/bash

2.2 build yolov3 engine

# inside container trt
export TRT_PATH=/usr/src/tensorrt
cd $TRT_PATH/samples/python/yolov3_onnx/;
pip install wget
pip install onnx==1.5.0
# will automatic download the model and convert into onnx
python yolov3_to_onnx.py;
# build trtexec engine
cd $TRT_PATH/samples/trtexec;
make; cd ../../;
./bin/trtexec --onnx=$TRT_PATH/samples/python/yolov3_onnx/yolov3.onnx --saveEngine=$TRT_PATH/model.plan
# Average over 10 runs is 30.8623 ms (host walltime is 31.4395 ms, 99% percentile time is 31.9949)

2.3 copy the built engine model.plan into inference repo $model_path

# at your host
mkdir -p $model_path/yolov3_608_trt/1
docker cp trt:/usr/src/tensorrt/model.plan $model_path/yolov3_608_trt/1

2.4 write the service protobuff config

# $model_path/yolov3_608_trt/config.pbtxtname: "yolov3_608_trt"
platform: "tensorrt_plan"
max_batch_size: 1
dynamic_batching {
preferred_batch_size: [1]
max_queue_delay_microseconds: 100
}
input [
{
name: "000_net"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 608, 608 ]
}
]
output [
{
name: "082_convolutional"
data_type: TYPE_FP32
dims: [ 255, 19, 19 ]
},
{
name: "094_convolutional"
data_type: TYPE_FP32
dims: [ 255, 38, 38 ]
},
{
name: "106_convolutional"
data_type: TYPE_FP32
dims: [ 255, 76, 76 ]
}
]
instance_group [
{
count:2
kind: KIND_GPU
}
]

restart the tensorrt inference server step at 1, i.e.:

export model_path=$PWD/docs/examples/model_repository
docker run \
--runtime nvidia \
--rm --shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
--name trt_serving \
-v $model_path:/models \
nvcr.io/nvidia/tensorrtserver:19.10-py3 \
trtserver --model-store=/models

3. prepare a yolov3 tensorrt inference client

# the following image were built from here

docker run \
--name yolov3_trt \
--gpus all \
--net=host \
-d penolove/tensorrt_yolo_v3:gpu \
tail -f /dev/null;
# access container
docker exec -ti yolov3_trt /bin/bash
## check servicecurl localhost:8000/api/status;## download client python library
## https://github.com/NVIDIA/tensorrt-inference-server/releases
wget https://github.com/NVIDIA/tensorrt-inference-server/releases/download/v1.7.0/v1.7.0_ubuntu1604.clients.tar.gz;
tar xvzf v1.7.0_ubuntu1604.clients.tar.gz;
pip3 install --user --upgrade python/tensorrtserver-*.whl;cd /workspace/yolov3-tensorrt;git pull;# inside yolo_client is an object detector wrappered by eyewitness
# after get the response from the tris(tesnorrt inference server)
# and then draw a 183 club image at detected_image/drawn_image.jpg
python3 yolo_client.py -m yolov3_608_trt demo/test_image.jpg

Conclusion, in this article we demonstrate a naive tensorrt inference server example including:

  • tensorrt inference server
  • tensorrt inference client
  • custom model (yolov3)

there are still lots of uncertainty for this service (how different backend coorperates, how gpu ram allocated, batch-size setting …), I will try to integrate this with nvidia jetson nano for both client and server.

also compare to yolov3 python tensorrt example:

naive_detector.py : ~6s for 100 times inference
yolo_client.py : ~9s for 100 times inference

the yolo_client is 1.5 times slower than python trt, which might because of the overhead of communication or queuing mechanism(using async or increase concurrecny >1 server-wise might able to reach higher throughput).

--

--