Scalable Deployment Pipeline of Deep Learning-based Recommender Systems with NVIDIA Merlin

Published in

NVIDIA Merlin

6 min readMay 5, 2021

By Ronay Ak, Onur Yilmaz, Benedikt Schifferer and Even Oldridge

When we work on machine learning (ML) or deep learning (DL) models, we particularly focus on building this very accurate model that gives high prediction results for our validation/test data. However, we need to think beyond that if the goal is to put these models into production and get useful business insights from them. An end-to-end ML/DL pipeline consists of preprocessing and feature engineering (ETL), model training, and model deployment for inference. Model deployment is the critical step of this pipeline since it is to start using our model for practical business decisions, therefore, the model(s) needs to be effectively integrated into the production environment.

Why is deployment of deep learning-based recommender systems hard?

Deploying a trained model to production is a significant engineering problem which is often neglected in literature.

The models can be accessed directly or indirectly by hundreds of millions of users. Our production system needs to be able to provide high throughput. In addition, online services often have latency requirements, such as requests need to be served to the user in less than 100s of milliseconds. Our production models have to be scalable and provide low latency for each request. Even if we fulfill these requirements, we still want that our production environment is optimally utilized and we do not want to have computational resources idle. And that’s not the end, there are still more requirements to a production system.

A less often discussed challenge is how to deploy preprocessing and feature engineering workflows. Making sure that the same transformations happen to the data that is used at training time takes significant engineering effort. During ETL, the workflow collects statistics, such as mean/std of numerical features or mappings from categorical features (userID to continuous integer). If we use different statistics in production than in training, our production prediction would be totally random. A big challenge is to keep these in sync — in particular, if multiple models are deployed at once (e.g. A/B tests). We should deploy ETL and Model as an ensemble and so that we can avoid incorrect predictions in production.

There are many aspects we need to consider to deploy our models to production. Our production systems need to provide high throughput and low latency, and well utilize computational resources. Data transformation in production has to be equivalent to the training setup and in the case of deep learning recommender systems, the models have to be trained frequently to learn information for new users/new items. A recent survey shows that ~68% of the companies (respondents to the survey) take more than 8 days to deploy a trained model into production. Results indicate that model deployment is still seen as one of the biggest challenges of the ML operations.

2020 State of Enterprise Machine Learning survey results by Algorithmia: The pie chart shows the time it takes an organization to deploy a single ML model. About 68% of the companies surveyed (herein referred to as Group B), say they spend more than 8 days deploying one model.

Scaling and Speed Matters

To address the challenges above, we’ve added Triton Inference Server (TIS) support to NVIDIA Merlin. NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems. The TIS integration enables the deployment of deep learning recommender models at scale with GPU acceleration. GPUs provide fast calculation and high memory bandwidth with up to 1.8 TB/s, which are excellent for high throughput and low latency requirements.

TIS is a cloud and edge inferencing solution optimized to deploy machine learning models for both GPUs and CPUs. It supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server. Some of the features of TIS that stands out are as follows:

supports all major deep learning frameworks, including custom builts
runs models concurrently on GPUs maximizing utilization
supports low latency real time inferencing or batch inferencing to maximize GPU/CPU utilization
integrates with Kubernetes for orchestration, metrics, and auto-scaling (as docker container)

In the new release, we are able to deploy NVTabular workflows for data transformation (ETL) and trained deep learning models with either TensorFlow or HugeCTR to Triton Inference Server. Combining these features, we can deploy large and complex recommender workflow to production with only a few lines of code.

We deploy the ETL workflow and deep learning model as an ensemble model in TIS. An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. Using ensemble models for this purpose can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton.

In order to serve a model (or an ensemble model), TIS requires a saved model file(s) and a config file that defines the shape and data type of inputs and outputs. TensorFlow and PyTorch provide functions to save learned model parameters into file(s). Similarly, NVTabular’s workflow and the learned parameters (stats, category mapping tables, etc) can be saved into file(s).

Figure 1: End-to-end Training and Deployment Workflow

Figure 1 shows the NVTabular and a DL framework end-to-end workflow. First, raw data is fed into NVTabular and a predefined workflow (i.e. FillMissing, LogOp, Normalize, etc. operations) is executed with the fit() function. Depending on the workflow, parameters such as statistics and category mapping tables are calculated from the raw data. Then, the raw data is transformed into new data using these calculated parameters. NVTabular stores the new data in multiple files (i.e. in parquet format). DL frameworks including TensorFlow, PyTorch and HugeCTR can consume the data stored in parquet format to train a DL model. Once the DL model training is completed, the whole end-to-end model is deployed using NVTabular’s export_tensorflow_ensemble() function for TF, export_hugectr_ensemble() function for HugeCTR , and export_pytorch_ensemble() for PyTorch model. These functions create three different folders for NVTabular, DL model, and the ensemble model. It saves all the calculated variables, workflow, and DL model in these folders, and creates the required Triton configuration files.

Figure 2: End-to-end Inference Workflow with Triton Ensemble Model

By pointing out these three folders to Triton, Ensemble model can be loaded to serve for inference queries. Triton initializes the models in the given folder and calls the necessary functions to load the NVTabular and DL workflow, stats, parameters, etc into memory to be used for the inference queries.

Figure 2 shows the end-to-end inference workflow. Client starts the inference workflow by sending a query using REST or GRPC. Triton takes this request and sends the data to NVTabular model. After the NVTabular transforms the data by applying the steps stored in the workflow file, Triton transfers the output of NVTabular to the DL model. Finally, the predict() function of the DL model is applied to intermediate data and the response is sent back to the client by Triton.

That sounds all great, but you still may think “Wait a minute, the embedding tables of my deep learning recommender systems are too large to fit on a GPU”. Don’t worry, we have a solution for these cases as well. Triton Inference Server with HugeCTR provides embedding cache, which stores the full embedding table on a parameter server (CPU memory or disk) and cache the most frequent embedding vectors on the GPU. The access to embedding tables often follows a power law distribution and only a small fraction is required to serve 95%-99% of the request. Stay tuned for our follow up blogpost or read the HugeCTR user guide.

Try out NVIDIA Merlin with Triton Inference Server Integration yourself!

In the NVTabular latest release, we provide examples for end-to-end recommendation pipelines from ETL to Training to Inference with TensorFlow and HugeCTR. Please check these examples for more details. If you are interested in scaling your ETL or training very large recommender systems, check out our scaling examples for Criteo. Our documentation contains information for accelerated training with TensorFlow, PyTorch and HugeCTR, as well. We would love to hear your feedback, you can reach us through our GitHub or by leaving a comment here. We look forward to hearing from you!

Scalable Deployment Pipeline of Deep Learning-based Recommender Systems with NVIDIA Merlin

Why is deployment of deep learning-based recommender systems hard?

Scaling and Speed Matters

Written by Ronay Ak