End-to-End Recommender Systems with Merlin: Part 3

6 min readJan 25, 2022

A tour for inference implementation using Triton Inference Server inside Merlin

At a Glance: So far we have gone through the pre-processing pipeline of the Criteo Dataset using the standard NVTabular toolkit present inside Merlin SDKs. This was a part of the ETL processing and it is present under Part 1. Followed by this, in Part 2, we have gone through the training procedures using the HugeCTR architecture. We have explored 4 standard state-of-the-art architectures using HugeCTR training toolkits inside Merlin.

In this section, we are going to explore the inference implementation which is the final deployment process using the Triton Inference Server.

Merlin At A Glance

Nvidia’s Merlin contains 3 crucial components. It includes Feature Engineering, Training, and Inference implementation. Feature Engineering pipeline is supported by NVTabular toolkits, Training is handled by HugeCTR, and deployment is taken care by the Triton Inference server. Figure 1 explains the architecture of Merlin and its underlying pipeline in depth.

Triton Inference Server

Nvidia’s Triton Inference server was formerly known as TensorRT Inference Server. It delivers fast and scalable AI in production. The server also streamlines AI inferencing enabling teams to deploy the trained model from any framework on any GPU-or CPU-based infrastructure. Figure 2 explains the inference server implementation on a diverse set of frameworks.

Figure 2: Inference Server supports multiple diverse frameworks

Features

Some of the crucial features that make the Triton server unique in itself are:

Support for diverse frameworks: It supports inferencing for a wide set of frameworks including TensorFlow, TensorRT, PyTorch, MXNet, Python, ONNX, RAPIDS FIL.
Support for Batching: The server can handle a batch of input requests and corresponding predictions.
Multi GPU Support: The inferencing procedures can be distributed across all the GPUs.
Concurrent execution of models: Multiple instances of the same models or multiple models can be executed simultaneously on the same GPU.
Metrics: A detailed metric that projects GPU utilization, Server Throughput, Server Latency.

Pre Requisites for DLRM model deployment

The trained DLRM model is to be deployed using the Triton Inference Server. There are several important prerequisite checks that need to be fulfilled. The initial step is to generate a DLRM deployment configuration file. For achieving this smoothly, it is important to maintain a good folder hierarchy and place all the related files under the same tree hierarchy. So, the below set of codes initializes the folder and copies the DLRM model files to Model Repository.

The next step is to copy all the DLRM model files to a specific model repository. It is achieved with the following set of codes.

Once a clean model repository has been initialised, the next step is to generate the Triton configuration file for deploying DLRM. It will be saved under dlrm_model_repo/config.pbtxt. This file demands having information about max_batch_size, specifying the instance group, count of GPUs and mentioning the GPU Id. It is achieved by the following configuration file.

Once this DLRM configuration file has been set up, the next step is to generate a Hugectr Backend parameter server configuration file for deploying DLRM. It needs information specifying the exact path of the sparse and dense model files. Here, the above file manipulation comes into the picture where we placed all the required files into one model repository for clean file access. It will be saved under model_folder/ps.json. It’s a file that follows the JSON format.

With this, we ensure all the prerequisites for our model deployment procedures. The next step is to start the docker container from the NVIDIA NGC for smooth conduct of the Triton Inference Server. The simplest way to get started with the Triton is to pull the container from the below command.

In the above section, xx.yy specifies the version of Merlin Inference Container. It can be viewed here. The next step is to run the above-pulled container. It is done using the below script:

This is just a sample script to run the container. Here I have mapped some ports depending on my use case. Also instead of mentioning nvcr.io/nvidia/merlin/merlin-inference:<xx.yy>, I have directly used the Image Id of this container. It can be easily found out using the below script:

Once the inference container has started, the next step is to instantiate a triton server using the model repository, configuration files, and parameter server files that we have developed above. A sample script to achieve the same is mentioned below:

In the above script, several paths have to be added precisely.

The parameter server file is present at: /workspace/aryan/dlrm_infer/model/ps.json
All the model files are present inside the model repository at: /workspace/aryan/dlrm_infer/model/
The developed model needs to be loaded at: /workspace/aryan/dlrm_infer/model/dlrm/1/dlrm.json

You may tweak these scripts as per your need and folder hierarchies.

The next important step is to ensure the Triton server status if the deployment of the DLRM model has been done successfully. To do this we’ll make use of the curl command.

If the deployment is successful, it will give an output similar to the one mentioned in below Figure 3.

It should prompt an HTTP response 200 as OK. By default, triton provides all the services on port 8000.

Once the server-side model is deployed, the next is to generate inferencing input data. To do so, utilise the validation set of data, that we have kept aside in Part 1. As we have already converted our data into parquet format, we’ll be reading the same using pandas library. We’ll be taking the initial 2,00,000 instances of the same and converting it to a CSV format.

The above script will give the following output for line no. 5 as mentioned in Figure 4.

The next step is to generate input data as per the triton requirements with JSON format for performance tests. It is achieved using the following set of code scripts.

The above script basically parses the input data as specified by the batch size. The convert function takes the CSV path file as the source file, dlrm_input_format file (it needs to be defined), destination paths, and the batchsize, along with a delimiter separation. Let’s define the input dlrm format for the source configuration. It is defined in the below chunk of code.

Now, we are all set to generate the input data format and save the format to a destination file. For convenience, the file name has been kept as the batchsize file name. It is achieved using the below script:

The generated data is stored in a file named, ‘batchsize.json’. The term batchsize used here is variable, and in this case, it is 1.json.

In the end, the final step is to generate the Inference benchmark by Triton Performance Toolkit. We are performing this for a batchsize of 1 initially. We’ll be using perf_analyzer, a performance analyzer toolkit. Model optimizations and Inference performance measurement track the changes as you experiment with diverse strategies of optimizations. The perf_analyzer, formerly known as perf_client, generates these inference requests to the deployed model and measures the latency and throughputs of those requests. It does so over a specific time window and repeats until it gets the stable values. It is achieved the following set of scripts:

It is already hosted over a server localhost:8000. This needs to be passed as a parameter along with the input data (1.json in this case), CATEGORICAL, DESCRIPTIVE, and ROWINDEX. The output for the above inference request is mentioned in the following Figure 5.

Figure 5: Output for Inference Request with BatchSize 1

You can try this out with several other BatchSize. Make sure to change the batchsize while generating input data, and also the max_batch_size limit mentioned at config.pbtxt file.

AUTHORS

Aryan Gupta: He is an Intern at NVIDIA.

Pallab Maji: He is a Senior Solutions Architect at NVIDIA.