Exploring Production Ready Recommender Systems with Merlin

Radek Osmulski
Published in
6 min readJul 5, 2022


by Radek Osmulski and Benedikt Schifferer

It has become really easy to start a machine learning project. Many libraries, such as XGBoost, TensorFlow, or PyTorch, make it easy to train a model. Likewise, there are many solutions for serving models in production. For either of these tasks done in isolation, you pick an off-the-shelf package and are off to the races.

Problems arise when you want to train an even slightly customized model, tailored to your use case and your data, and then deploy that model to production. Now the situation becomes much more complex. How do you package the model that you trained in a way that can be served? How do you make sure that the data your model sees in production is similarly transformed to the data it was exposed to during training?

And to bring these considerations even closer to real life, what about the research stage? It is quite common to use different tools for research, then switch to another set of tools for training and yet another set of tools for deployment.

Instead of solving the actual problem, which is delivering good recommendations to your customers, your time is spent on juggling all these pieces, doing your best to integrate them correctly, each with its own set of quirks.

The situation is so dire that according to the 2021 Enterprise Trends in Machine Learning by Algorithmia, only 36% of companies are able to deploy a model within a month, with nearly a third of companies needing more than 6 months!

Easing Recommender Models Deployment

At NVIDIA on the Merlin Team, we understand exactly how that feels. We are practitioners at heart and have cut our teeth solving this particular problem at some of the largest organizations in the world. We are now developing a set of tools that would provide an integrated pipeline, from ideation through experimentation, and training on arbitrarily large data, all the way to deployment.

The overarching idea is to capture and reuse information from earlier stages (research and training) and streamline deployment and reduce the chance of introducing subtle bugs. If you have the various pieces of your pipeline talking to each other instead of you attempting to provide translation between disparate dialects, all that already puts you in a good place.

But even for the final stage, deployment, the Merlin Framework has a lot of custom functionality to offer. We don’t develop tools where recommender models are just one of the models that can be served, that are a mere backdrop to CV and NLP models. We put recommender models centerstage and provide all the functionality that might be required.

Let’s take a look at what deployment with Merlin Systems might look like. First, let us consider how we can reuse the information we provided while preprocessing data and training our model to streamline model deployment.

Building on top of our earlier work

To preprocess data at scale and to do so in a timely manner, we might have leveraged NVTabular. It is a library for data preprocessing with a collection of industry best practice operators. It executes on the GPU and can run on arbitrarily large data due to leveraging dask_cudf.

In the dataset preprocessing stage we have described our dataset and the preprocessing steps to be taken and have collected all this information into a Workflow.

Let us now load it along with a model that we trained.

Looking good!

We can now create a pipeline that will leverage the preprocessing Workflow along with the model that we trained and combine it with an operator for making predictions.

And we are all set!

The only step that remains now is to export our ensemble and reference the directory to run an instance of the Triton Inference Server to host it. The Triton Inference Server is a high inference throughput, highly scalable solution. It puts advanced deployment techniques at your fingertips (such as dynamic batching with latency guarantees) and seamlessly integrates with the Merlin Framework. To learn more about the capabilities of the Triton Inference Server, please consult this blog post.

Now that we have exported our ensemble, all we need to do is run the server executing tritonserver — model-repository=”/export_path/” and our model is deployed!

This was a short overview of the functionality of Merlin Systems. If you would like to follow along in code, we prepared an example notebook you can access here.

This was a good start but we know that often the functioning of a recommender system is much more complex. In one of our other blog posts, Recommender Systems, Not Just Recommender Models, we go deeper into outlining the 4 distinct stages of serving recommendations along with their unique challenges.

Let us now see how we could address such a more complex scenario with just a few lines of code.

Defining complex recommender system workflows

We begin by loading the components we have prepared in data preprocessing and training. This time we will also use a feature store.

Defining the fields we expect to see in requests and fetching user features from the feature store can be achieved very succinctly as depicted below.

We are now ready to run our models.

We begin by defining the candidate retrieval operation, followed by filtering, and ultimately we rank candidates based on the interaction between user and item features.

All that we have to do now is to sort our predictions in order of how relevant they may be. By manipulating the temperature parameter to sampling from Softmax we control the exploit-explore trade-off in our recommendations.

The only thing that remains now is to output the pipeline.

We now have everything that we need and can get our more complex recommender system pipeline served just by pointing the Triton Inference Server to the ./ensemble directory.

If you would like to devise a workflow specific to your data and efficiently bring models into production, please take a look at our example notebooks for further exploration.

Where to go next

If you found this way of developing recommender systems appealing, please do check out our product page or the libraries we developed to learn more. You can also find the code for the examples we discussed above in our repositories here and here. If you would have any comments or questions, we would love to hear from you. Probably the best way to get in touch is via Github issues, where we would be delighted to answer any questions that you might have.

Also, on July 28, 2022, we will host the NVIDIA’s RecSys Summit 2022 online. Many guest speakers from industry will talk about the challenges of applying recommender systems to the real world. Sign up and join for free!

Thank you very much for reading. Talk soon!



Radek Osmulski

I ❤️ ML / DL ideas — I tweet about them / write about them / implement them. Recommender Systems at NVIDIA