[yolov8] Batch inference implementation using tensorrt#3 — batch inference using TensorRT python api

6 min readAug 19, 2023

Introduction

In the previous section, we built an optimized engine that can run on NVIDIA gpu. Engine can inference using deepstream or tensorrt api. Here, we perform batch inference using the TensorRT python api.
(NOTE: Most of the codes introduced here refer to examples provided by nvidia and include personal changes)

Batching your input

There are several batching methods. Dataloader can be used by using the Torch package, but here, it is implemented using a generator from the scratch.

BATCH_SIZE=5
INPUT_SHAPE_W_BS = (BATCH_SIZE, 3, 640, 640)
ALLOWED_EXTENSIONS = (".jpeg", ".jpg", ".png")

src_files = [
    path for path in glob.iglob(os.path.join(img_path, "**"), recursive = True)
    if os.path.isfile(path) and path.lower().endswith(ALLOWED_EXTENSIONS)
]
if len(src_files) == 0:
        raise Exception(
            "ERROR: src data path [{}] contains no files!".format(
                img_path
            )
       
# Add files for making a multiple of batch size
if len(src_files) % BATCH_SIZE != 0:
    src_files += src_files[len(src_files) % BATCH_SIZE : BATCH_SIZE]
# initialize batch
init_batch = np.zeros(INPUT_SHAPE_W_BS, dtype=np.float32)
# make batch
def load_batches(batch, src_files, preprocessing_func):
    for i in range(0, len(src_files), BATCH_SIZE):
        for offset in range(BATCH_SIZE):
            img = Image.open(src_files[i + offset])
            batch[offset] = preprocessing_func(img)
        yield batch
def get_batch(batches):
    try:
        batch = next(batches)
        return batch
    
    except StopIteration:
        return None

Therefore, whenever you do get_batch, (batch size, 3, 640, 640) batching input comes out.

Pre-processing

The pre-processing here refers to resizing, normalization, or letter boxing applied during training.

If you look at the load_batches function, you can see that pre-processed images are used for batching when performing this. The pre-processing function used here is as follows.

def default_preprocessing(img):
    scr_w, scr_h = img.size
    inp_w, inp_h = INPUT_SHAPE_W_BS[2], INPUT_SHAPE_W_BS[3]
    scale_ratio = min(inp_w / scr_w, inp_h / scr_h)
    nw = int(scr_w * scale_ratio)
    nh = int(scr_h * scale_ratio)
    image = img.resize((nw, nh), Image.BICUBIC).copy()
    new_image = Image.new("RGB", (inp_w, inp_h))
    new_image.paste(image, ((inp_w - nw) // 2, (inp_h - nh) // 2))
    
    scaled_image = np.asarray(new_image, dtype=np.float32) * 0.0039215697906911373
    whc2cwh = np.swapaxes(scaled_image, 2, 0)
    cwh2chw = np.swapaxes(whc2cwh, 2, 1)
    
    return cwh2chw

No letter boxing involved here.
For letterboxing pre-processing, see:

def letterbox_preprocessing(img):
    scr_w, scr_h = img.size
    inp_w, inp_h = INPUT_SHAPE_W_BS[2], INPUT_SHAPE_W_BS[3]
    # Letter boxing
    scale_ratio = min(inp_w / scr_w, inp_h / scr_h)
    nw = int(scr_w * scale_ratio)
    nh = int(scr_h * scale_ratio)
    image = img.resize((nw, nh), Image.BICUBIC).copy()
    new_image = Image.new("RGB", (inp_w, inp_h))
    new_image.paste(image, ((inp_w - nw) // 2, (inp_h - nh) // 2))

    scaled_image = (
            np.asarray(new_image, dtype=np.float32) * 0.0039215697906911373
        )  # Rescaling factor: 1/255 = 0.0039215697906911373
        # PIL image pre-processing
        # WHC (PIL2numpy) -> 1. CWH -> 2. CHW -> 3. NCHW
        whc2cwh = np.swapaxes(
            scaled_image, 2, 0
        )  # NETWORK INPUT ORDER (1. channel axis move to 0 from 2)
        cwh2chw = np.swapaxes(whc2cwh, 2, 1)  # NETWORK INPUT ORDER (2. HW -> WH)
        img_data = cwh2chw
        return img_data

Inferencing

Inference is divided into four types (execute_async, execute_async_v2, execute_v2, execute) according to sync and async methods, dynamic batch and fixed batch. v2 is a function capable of dynamic batching, and sync and async distinguish whether to use the gpu synchronously or asynchronously. You can use it according to your own purpose. Here, async_execute_v2 that can perform async and dynamic batch is used. (https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/ExecutionContext.html)
In this function, context, bindings, and stream_handle are input variables. Let’s explain some of the processes necessary when preparing input variables, and explains the final inference code.

def allocate_buffers(engine, batch_size = 1):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(-size if size < 0 else size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream
    
    class HostDeviceMem(object):
        def __init__(self, host_mem, device_mem):
            """
            Within this context, host_mom means the cpu memory and device means the GPU memory
            """
            self.host = host_mem
            self.device = device_mem
        def __str__(self):
            return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
        def __repr__(self):
            return self.__str__()

allocate_buffers function allocates input and output data types and memory sizes bound to the engine from host memory (cpu) to device memory (gpu).

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer data from CPU to the GPU.
    context.set_binding_shape(0, [batch_size, *context.get_binding_shape(0)[1:]])
    for inp in inputs:
        device_ptr = inp.device # binding array
        host_array = inp.host # input array
        cuda.memcpy_htod_async(device_ptr, host_array, stream)

    # Run inference.
        # context.execute_v2(bindings=bindings)  # for Profiling
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)  # for Inferencing
        # Transfer predictions back from the GPU.
        for out in outputs:
            cuda.memcpy_dtoh_async(out.host, out.device, stream)
            
        # Synchronize the stream
        stream.synchronize()
        
        # Return only the host outputs.
        host_outputs = []
        for out in outputs:
            host_outputs.append(out.host)
        return host_outputs

The do_inference function is a function that executes inference with the device memory allocated in the engine context and returns the output.

Post-processing

The inference result comes out in vector format (1 x N), and it needs to be reshaped according to the batch size and output shape of the model.
(Here, the original yolov8 has one output, output0. Referring to article 2 of this series, it can be modified to a model with conf, class_id, and bbox output.)

def reshape_trt_outputs(h_outputs, shape_of_output):
    # print("h_outputs", h_outputs)
    h_outputs = h_outputs.reshape(*shape_of_output)
    return h_outputs

reshape_trt_outputs is a function that changes the inference output result into the original shape because it comes out in vector format.
For example, the shape of bbox output for batchsize 5 comes out as (1, 168000), which is changed to (5, 8400, 4) shape.

def parse(trt_output, trt_output_shape):
    # for removing index order dependency of outputs, use if statements.
    for i in range(len(trt_output_shape)):
        if trt_output_shape[i]["name"] == 'conf':
            shaped_trt_conf = reshape_trt_outputs(trt_output[i], trt_output_shape[i]["shape"])
        elif trt_output_shape[i]["name"] == 'bbox':
            shaped_trt_bbox = reshape_trt_outputs(trt_output[i], trt_output_shape[i]["shape"])
        elif trt_output_shape[i]["name"] == 'class_id':
            shaped_trt_class = reshape_trt_outputs(trt_output[i], trt_output_shape[i]["shape"])
    return shaped_trt_bbox, shaped_trt_conf, shaped_trt_class

parse function changes all shapes by output name and returns the final output. Therefore, shaped_trt_bbox has the shape of (BATCH_SIZE, 8400, 4), and shaped_trt_bbox and shaped_trt_class have the shape of (BATCH_SIZE, 8400).

main function

with open(engine_path, 'rb') as f, trt.Runtime(trt.Logger(trt.Logger.WARNING)) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
    inputs, outputs, bindings, stream = allocate_buffers(engine, batch_size=BATCH_SIZE)
    context = engine.create_execution_context()
    
init_time = time.time()
batches = load_batches(init_batch, src_files, default_preprocessing)
for i in range((len(src_files) // BATCH_SIZE)):
    batch = get_batch(batches)
    inputs[0].host = batch.reshape(-1)
    trt_output = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=BATCH_SIZE)
    output_shape = []

# for removing order dependency of outputs, use binding index of engine and get name and shape
    for binding_idx in range(1, 4): # index 0 is input, index 1 ~ 4 are outputs 
        output_shape.append({"name": engine.get_binding_name(binding_idx), \
            "shape": (BATCH_SIZE, *engine.get_binding_shape(binding_idx)[1:])})
    shaped_trt_bbox, shaped_trt_conf, shaped_trt_class = parse(trt_output, output_shape)

The main function includes the process of deserializing the engine and obtaining the final output of bbox, conf, and class_id.
The example image result of yolov8 for this is as follows.

The above result shows the raw yolov8 result that does not include the post-processing NMS result. (The implementation of the code for nms post-processing will be done later.)

Conclusion

Thus, batch inference was performed using the tensorrt python api with the yolov8 model. It was amazing to see the raw results of the deep learning network after always seeing the refined results from post-processing. Next, we will implement nms post-processing to show cleaner results.

Reference

TensorRT inference code example — https://github.com/NVIDIA/TensorRT/blob/release/8.6/samples/python

—

About Authors

Hello, I’m Deeper&Cheaper.

I am a developer and blogger with the goal of integrating AI technology into the lives of everyone, pursuing the mission of “Make More People Use AI.” As the founder of the startup Deeper&Cheaper, operating under the slogan “Go Deeper Make Cheaper,” I am dedicated to exploring AI technology more deeply and presenting ways to use it cost-effectively.
The name encapsulates the philosophy that “Cheaper” reflects a focus on affordability to make AI accessible to everyone. However, from my perspective, performance is equally crucial, and thus “Deeper” signifies a passion for delving deep with high performance. Under this philosophy, I have accumulated over three years of experience in various AI fields.
With expertise in Computer Vision and Software Development, I possess knowledge and skills in diverse computer vision technologies such as object detection, object tracking, pose estimation, object segmentation, and segment anything. Additionally, I have specialized knowledge in software development and embedded systems.
Please don’t hesitate to drop your questions in the comments section.

[yolov8] Batch inference implementation using tensorrt#3 — batch inference using TensorRT python api

Introduction

Batching your input

Pre-processing

Inferencing

Post-processing

main function

Conclusion

Reference

Trending Articles

Hit! [yolov8] converting to Batch model engine

Hit! [Quantization] Go Faster with ReLU!

[Quantization] Achieve Accuracy Drop to Near Zero

[Quantization] How to achieve the best QAT performance

[Yolov8/Jetson/Deepstream] Benchmark test

[yolov8] NMS Post Processing implementation using only Numpy

[yolov8] batch inference using TensorRT python api

About Authors

Written by DeeperAndCheaper