[yolov8] Batch inference implementation using tensorrt #2 — converting to Batch model engine

DeeperAndCheaper
8 min readAug 17, 2023

--

Overview

  • There are prerequisites for batch inference using TensorRT. In this article, we will cover the following topics.
  • How the deep learning model runs in TensorRT
  • How to create a deep learning model capable of batch inference
  • How to check if it is a model capable of batch inference

Introduction

  • In the early days of deep learning, several frameworks such as tensorflow and pytorch were used, but these days, many frameworks have been established. Deep learning training, evaluation, retraining, and experimentation use pytroch, and distribution depends on hardware conditions, but TensorRT is used for NVIDIA GPUs.
  • Because different frameworks are used for training and deployment, the format of the deep learning model is also different, .pth (or .pt) and .engine.
  • However, since .engine, a TensorRT deep learning format, is optimized for functions only for inference, it has limitations in providing readable information to users.
  • For this reason, the .onnx format is used as an intermediate step. It specializes in providing users with quite a bit of information about deep learning models (input shape, output shape, visualization, custom meta info, etc.).
  • Therefore, in general, the process of training (.pth) -> intermediate representation (.onnx) -> deployment (.engine) will be the overall workflow experienced by deep learning model researchers and engineers.
  • This article is explained using yolov8, which is used by many people and is easy to use.

Pytorch to Onnx

Since the yolov8 repo provides a script that can be exported with onnx, you can get a model in onnx format very easily.

The yolov8 repo is https://github.com/ultralytics/ultralytics, and the yolov8 environment was built using conda.

We do not use the yolov8 model as it is, but modify the head part of yolov8. The reason is that one y output is returned when exporting before modification. This is because the bbox information and conf information are concated, so it is difficult to figure out what kind of output it has and in what order it is concated unless you explicitly inform it. Therefore, here, the head part code is slightly modified to return three outputs: bbox, conf, and class_id as follows.

(Yolov8 head code)

before modification

x_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)
if self.export and self.format in (‘saved_model’, ‘pb’, ‘tflite’, ‘edgetpu’, ‘tfjs’): # avoid TF FlexSplitV ops
box = x_cat[:, :self.reg_max * 4]
cls = x_cat[:, self.reg_max * 4:]
else:
box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)
dbox = dist2bbox(self.dfl(box), self.anchors.unsqueeze(0), xywh=True, dim=1) * self.strides

y = torch.cat((dbox, cls.sigmoid()), 1)
return y if self.export else (y, x)

After modification

x_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)
if self.export and self.format in ('saved_model', 'pb', 'tflite', 'edgetpu', 'tfjs'): # avoid TF FlexSplitV ops
box = x_cat[:, :self.reg_max * 4]
cls = x_cat[:, self.reg_max * 4:]
else:
box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)
conf, label = cls.sigmoid().max(1)
dbox = dist2bbox(self.dfl(box), self.anchors.unsqueeze(0), xywh=True, dim=1) * self.strides
y = torch.cat((dbox, cls.sigmoid()), 1)
dbox = dbox.transpose(1,2)

return (dbox, conf, label) if self.export else (y, x)

If you modify the Head part like this, you are ready to export to onnx.

The script to export yolov8 pytorch model to onnx is as follows. (https://docs.ultralytics.com/modes/export/)

from ultralytics import YOLO
# Load a model
model = YOLO('yolov8n.pt') # load an official model
# Export the model
model.export(format='onnx', dynamic=True, simplify=True)

Yolov8 provides various options such as simplify, dynamic, and opset when exporting onnx.

Simplify and opset will be explained later if there is a chance, but only the dynamic option will be explained for now.

Batch inference here means that the batch size corresponding to the first dimension of (1,3,640,640), the input shape of yolov8, is inferenced with an integer of 2 or more.

That is, it means to dynamically determine the batch size by giving the dynamic option rather than fixed (or static).

A point to note here is that the original yolov8 exports onnx to dynamically determine not only the batch size but also the image width and height. Below is a reference to yolov8's exporter.py code.

if dynamic:
dynamic = {'images': {0: 'batch', 2: 'height', 3: 'width'}} # shape(1,3,640,640)
if isinstance(self.model, SegmentationModel):
dynamic['output0'] = {0: 'batch', 2: 'anchors'} # shape(1, 116, 8400)
dynamic['output1'] = {0: 'batch', 2: 'mask_height', 3: 'mask_width'} # shape(1,32,160,160)
elif isinstance(self.model, DetectionModel):
dynamic['output0'] = {0: 'batch', 2: 'anchors'} # shape(1, 84, 8400)

However, since we only need to dynamic the batch size, we need to modify the code as follows.

if dynamic:
# dynamic = {'images': {0: 'batch', 2: 'height', 3: 'width'}} # shape(1,3,640,640)
dynamic = {'images': {0: 'batch'}} # shape(1,3,640,640)
if isinstance(self.model, SegmentationModel):
dynamic['output0'] = {0: 'batch', 1: 'anchors'} # shape(1,25200,85)
dynamic['output1'] = {0: 'batch', 2: 'mask_height', 3: 'mask_width'} # shape(1,32,160,160)
elif isinstance(self.model, DetectionModel):
# dynamic['output0'] = {0: 'batch', 1: 'anchors'} # shape(1, 25200, 85)
dynamic[‘output0’] = {0: ‘batch’}

Looking at the meaning of the elements inside the braces, ‘Images’ is the name of the input node, and 0 means the index of the batch dimension. ‘output0’ is also the name of the output node, and 0 is the batch dimension index of the output.

If you execute the export command after modifying like this, you can get the onnx model with dynamic batch.

Visualize onnx model by Netron

그렇다면 export가 내가 의도한 대로 된 것인지 어떻게 알 수 있는가 ? Onnx모델의 input, output노드들의 shape이 어떻고, 각 노드들이 어떻게 연결되어있는지 보여줄 수 있는 tool이 있다. 바로 ‘netron’이라는 visualize app이다. 온라인으로도 사용할 수 있고 오프라인으로도 사용할 수 있다. (나의 경우, 모델러로부터 받을 모델이 잘못된 부분이 없는지 체크하는 데 많이 사용한다. )
https://netron.app/ 사이트에 접속해서 onnx model을 open하게 되면 네트워크 그래프를 볼 수 있게 된다.

So how do I check if the export process is success or fail? There is a tool that can show the shape of the input and output nodes of the Onnx model and how each node is connected. It is a visualization app called ‘netron’. It can be used both online and offline. (In my case, I use it a lot to check whether the model to be received from the modeler has any errors.)

If you access the https://netron.app/ site and open the onnx model, you can see the network graph.

The part to look carefully here is the shape of the input and output. If the first dimension of the inputs and outputs is ‘batch’, it is successful. (For reference, it should appear as a character like ‘batch’, but tensorrt recognizes it as -1 and specifies it as a dynamic type.)

For reference, the onnx model without modifying the head part and export code comes out as follows.

(Not only batch but also height and width become dynamic.)

Onnx to TensorRT

There are two ways to change Onnx to tensorrt: using a tool provided by nvidia called trtexec, and using tensorrt c++/python api to write and change builder code.

Here, we will create a .engine file by converting it to an easy and simple trtexec.

The trtexec command line is: (If Trtexec is not installed, we recommend using TensorRT official repo (https://github.com/NVIDIA/TensorRT) or using TensorRT docker.

trtexec --onnx=${path/to/onnx} --saveEngine=${path/to/saved/engine} --minShapes='images':1x3x640x640 --optShapes='images':16x3x640x640 --maxShapes='images':16x3x640x640 --explicitBatch --fp16

A brief description of each option is as follows:

onnx: path of onnx model to load

saveEngine: path of engine to save

minShapes: Minimum batch size, format is ‘onnx input name’: batchsize x channel x height x width

optShapes: Optimal batch size, format is the same as above.

maxShapes: the size of the largest batch

fp16: floating point 16 option among precision of engine

(Optional) explicitBatch: Option to use dynamic batch (applied automatically in case of onnx model)

(Tip1: It is best for inference time to match optShapes with the batch size you usually use)

(Tip2: It is convenient to set the path as an absolute path)

You can see how much interference time and throughput performance this engine model shows for batch input with the following command.

trtexec —-loadEngine=${path/to/engine} --Shapes=‘images’:${batch_size}x3x640x640 —-iterations=100

loadEngine is the path of the converted engine,

Shapes are the names and shapes of inputs to be put as inputs during inference. For batch_size, a value equal to or smaller than the max batch size can be entered during conversion.

Result

This is the result of testing batch 1 to 16 in the Jetson AGX xavier (jetpack 4.6) environment with the Yolov8-m model.

From batch size 5, the inference time per image was not further improved to about 18 ms. The reason is that the gpu resource has become saturated. (https://forums.developer.nvidia.com/t/inference-time-is-not-improving-with-the-increase-in-batch-size/215710)

Conclusion

Thus, the yolov8 model was converted from the pytorch format to the engine format used for onnx and inference, and the batch inference results were briefly derived.

It was confirmed that batch inference actually helps inference time, but there was no clear improvement effect. Therefore, it seems important to find the optimal batch size for your gpu resource.

Next, we will make actual images as batch inputs, implement batch inference with tensorrt python api, and measure inference time and gpu utils.

Trending Articles

Hit! [yolov8] converting to Batch model engine

Hit! [Quantization] Go Faster with ReLU!

[Quantization] Achieve Accuracy Drop to Near Zero

[Quantization] How to achieve the best QAT performance

[Yolov8/Jetson/Deepstream] Benchmark test

[yolov8] NMS Post Processing implementation using only Numpy

[yolov8] batch inference using TensorRT python api

About Authors

Hello, I’m Deeper&Cheaper.

  • I am a developer and blogger with the goal of integrating AI technology into the lives of everyone, pursuing the mission of “Make More People Use AI.” As the founder of the startup Deeper&Cheaper, operating under the slogan “Go Deeper Make Cheaper,” I am dedicated to exploring AI technology more deeply and presenting ways to use it cost-effectively.
  • The name encapsulates the philosophy that “Cheaper” reflects a focus on affordability to make AI accessible to everyone. However, from my perspective, performance is equally crucial, and thus “Deeper” signifies a passion for delving deep with high performance. Under this philosophy, I have accumulated over three years of experience in various AI fields.
  • With expertise in Computer Vision and Software Development, I possess knowledge and skills in diverse computer vision technologies such as object detection, object tracking, pose estimation, object segmentation, and segment anything. Additionally, I have specialized knowledge in software development and embedded systems.
  • Please don’t hesitate to drop your questions in the comments section.

--

--