Yolov3 with tensorrt-inference-server

4 min readSep 15, 2019

In this article, you will learn how to run a tensorrt-inference-server and client. And will use yolov3 as an example

requirement: nvidia-docker installed, test your env first with:

docker run --rm --gpus all nvcr.io/nvidia/tensorrtserver:19.10-py3 nvidia-smi

there will be 3 parts in this article:

setup inference-sever first
prepare yolov3 tensorrt engine
prepare yolov3 inference client

1. setup inference-sever first

the architecture of tensorRT inference server is quite awesome which supports frameworks like tensorrt, tensorflow, caffe2, and also a scheduler mechanism implemented.

NVIDIA TensorRT Inference Server - NVIDIA TensorRT Inference Server 1.5.0 documentation

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server…

docs.nvidia.com

but, I didn’t dive so deep yet, let’s just follow the quick start instruction first.

NVIDIA TensorRT Inference Server - NVIDIA TensorRT Inference Server 1.5.0 documentation

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server…

docs.nvidia.com

download example models into model_path

git clone https://github.com/NVIDIA/tensorrt-inference-server
git checkout r19.09
cd docs/examples
./fetch_models.sh
# model stores in tensorrt-inference-server/docs/examples/model_repository

start-server:

export model_path=$PWD/docs/examples/model_repository
docker run \
    --runtime nvidia \
    --rm --shm-size=1g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    --name trt_serving \
    -v $model_path:/models \
    nvcr.io/nvidia/tensorrtserver:19.-py3 \
    trtserver --model-store=/models

and now you can use curl to check the model status

curl localhost:8000/api/status# which should get something like this
model_status {
  key: "densenet_onnx"
  value {
    config {
      name: "densenet_onnx"
      platform: "onnxruntime_onnx"
      version_policy {
        latest {
          num_versions: 1
        }
      }
      input {
        name: "data_0"
        data_type: TYPE_FP32
        format: FORMAT_NCHW
        dims: 3
        dims: 224
        dims: 224
        reshape {
          shape: 1
          shape: 3
          shape: 224
          shape: 224
        }
      }
      output {
        name: "fc6_1"
        data_type: TYPE_FP32
        dims: 1000
        label_filename: "densenet_labels.txt"
        reshape {
          shape: 1
          shape: 1000
          shape: 1
          shape: 1
        }
      }
      instance_group {
        name: "densenet_onnx"
        count: 1
        gpus: 0
        kind: KIND_GPU
      }
      default_model_filename: "model.onnx"
    }
    version_status {
      key: 1
      value {
        ready_state: MODEL_READY
      }
    }
  }
}

ok, let’s try a resnet50 example with tensorrt-inference client

docker build -t tensorrtserver_client -f Dockerfile.client .
docker run -it --rm --net=host tensorrtserver_client
python src/clients/python/image_client.py -m resnet50_netdef -s INCEPTION images/mug.jpg
#Request 0, batch size 1 #Image 'images/mug.jpg': #    504 (COFFEE MUG) = 0.777365028858

seems great right? we have quickly run a server and client example, let’s close the server and client first.

2. prepare yolov3 tensorrt engine

follow the guide to build a yolov3 tensorrt engine

2.1 start tensorrt container

docker run \
       --gpus all \
       -v $PWD/trt:/workspace/trt \
       --name trt \
       -ti nvcr.io/nvidia/tensorrt:19.10-py2 /bin/bash

2.2 build yolov3 engine

# inside container trt
export TRT_PATH=/usr/src/tensorrt
cd $TRT_PATH/samples/python/yolov3_onnx/;pip install wget
pip install onnx==1.5.0# will automatic download the model and convert into onnx
python yolov3_to_onnx.py;# build trtexec engine 
cd $TRT_PATH/samples/trtexec; 
make; cd ../../; 
./bin/trtexec --onnx=$TRT_PATH/samples/python/yolov3_onnx/yolov3.onnx --saveEngine=$TRT_PATH/model.plan # Average over 10 runs is 30.8623 ms (host walltime is 31.4395 ms, 99% percentile time is 31.9949)

2.3 copy the built engine model.plan into inference repo $model_path

# at your host
mkdir -p $model_path/yolov3_608_trt/1docker cp trt:/usr/src/tensorrt/model.plan $model_path/yolov3_608_trt/1

2.4 write the service protobuff config

# $model_path/yolov3_608_trt/config.pbtxtname: "yolov3_608_trt"
platform: "tensorrt_plan"
max_batch_size: 1
dynamic_batching {
  preferred_batch_size: [1]
  max_queue_delay_microseconds: 100
}
input [
  {
    name: "000_net"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 608, 608 ]
  }
]
output [
  {
    name: "082_convolutional"
    data_type: TYPE_FP32
    dims: [ 255, 19, 19 ]
  },
  {
    name: "094_convolutional"
    data_type: TYPE_FP32
    dims: [ 255, 38, 38 ]
  },
  {
    name: "106_convolutional"
    data_type: TYPE_FP32
    dims: [ 255, 76, 76 ]
  }
]instance_group [
  {
    count:2
    kind: KIND_GPU
  }
]

restart the tensorrt inference server step at 1, i.e.:

export model_path=$PWD/docs/examples/model_repository
docker run \
    --runtime nvidia \
    --rm --shm-size=1g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    --name trt_serving \
    -v $model_path:/models \
    nvcr.io/nvidia/tensorrtserver:19.10-py3 \
    trtserver --model-store=/models

3. prepare a yolov3 tensorrt inference client

# the following image were built from here

docker run \
    --name yolov3_trt \
    --gpus all \
    --net=host \
    -d  penolove/tensorrt_yolo_v3:gpu \
    tail -f /dev/null;# access container 
docker exec -ti yolov3_trt /bin/bash## check servicecurl localhost:8000/api/status;## download client python library
## https://github.com/NVIDIA/tensorrt-inference-server/releaseswget https://github.com/NVIDIA/tensorrt-inference-server/releases/download/v1.7.0/v1.7.0_ubuntu1604.clients.tar.gz;
tar xvzf v1.7.0_ubuntu1604.clients.tar.gz;pip3 install --user --upgrade python/tensorrtserver-*.whl;cd /workspace/yolov3-tensorrt;git pull;# inside yolo_client is an object detector wrappered by eyewitness
# after get the response from the tris(tesnorrt inference server)
# and then draw a 183 club image at detected_image/drawn_image.jpg
python3 yolo_client.py -m yolov3_608_trt demo/test_image.jpg

Conclusion, in this article we demonstrate a naive tensorrt inference server example including:

tensorrt inference server
tensorrt inference client
custom model (yolov3)

there are still lots of uncertainty for this service (how different backend coorperates, how gpu ram allocated, batch-size setting …), I will try to integrate this with nvidia jetson nano for both client and server.

also compare to yolov3 python tensorrt example:

naive_detector.py : ~6s for 100 times inference
yolo_client.py : ~9s for 100 times inference

the yolo_client is 1.5 times slower than python trt, which might because of the overhead of communication or queuing mechanism(using async or increase concurrecny >1 server-wise might able to reach higher throughput).

Yolov3 with tensorrt-inference-server

1. setup inference-sever first

NVIDIA TensorRT Inference Server - NVIDIA TensorRT Inference Server 1.5.0 documentation

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server…

NVIDIA TensorRT Inference Server - NVIDIA TensorRT Inference Server 1.5.0 documentation

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server…

2. prepare yolov3 tensorrt engine

3. prepare a yolov3 tensorrt inference client

Written by 楊亮魯