NVIDIA Merlin meets the MLOps ecosystem: building a production-ready RecSys pipeline on cloud

Published in

NVIDIA Merlin

9 min readFeb 22, 2023

by Jacopo Tagliabue, hugo bowne-anderson, Ronay Ak, Gabriel Moreira, and Sara Rabhi

Natural Language Processing has experienced phenomenal growth in the past few years. Among other factors, the interplay between research and the open-source ecosystem played a central role in this story: libraries like Hugging Face provide easy to use, widely adopted APIs, which shorten feedback loops and incentivize new model releases to follow a common paved path. Quickly, a plethora of cloud services spawned to productionize these models, taking advantage of the open nature of the API and the standardization in the interface.

The Hugging Face moment for Recommender Systems

Recommender Systems (RecSys) have resisted a similar standardization story: partly because recommendations are, by nature, more specific than language, and partly because flexible deep learning architectures took longer to spread in the RecSys community. As deep learning for recommendations has become more popular and effective, the field is getting closer to its own “Hugging Face moment”: NVIDIA’s open source Merlin framework’s mission is to provide easy to use APIs to create production-ready recommender system pipelines at scale. Since high-performing machine learning systems are about healthy pipelines, from raw data to serving, it is therefore natural to ask how Merlin plays into the broader theme of the “Cambrian explosion” of DataOps and MLOps tools.

A tale of two tailwinds

This post builds on both trends and presents a (fully open-sourced) production-ready pipeline for deep learning recommendations. By moving away from research use cases based on static files and notebooks, to constantly updating data pipelines, we show that NVIDIA Merlin can be easily embedded into an existing open source toolchain:

we obtain an end-to-end system that a small team of engineers can understand and maintain: the key insights are abstracting away modeling through Merlin, and infrastructure through Metaflow;
we develop a modular system for iterative development: existing Metaflow pipelines can easily accommodate Merlin Models, and Merlin components can be added/changed/refined in an iterative fashion leaving the rest untouched.

Clone the repo and tag along. We will also be doing a live coding session going through the entire production-ready pipeline. If this is of interest, you can sign up here!

End-to-end fashion recommendations

We use the publicly available H&M dataset to reproduce a classic user-item, offline training-offline serving recommendation use case, that is:

User-item: at test time, our model will receive a user identifier as the query, and will be tasked with suggesting k (fashion) items that the user is likely to buy. Since we have ground truth for past behavior, we can use previous purchases to test our model;
offline training-offline serving: we assume our user base to be pretty constant from one day to the next, and items not changing very rapidly. This allows us to run our pipeline nightly and prepare predictions in batch, so that recommendations can be served with low latency at massive scale. While not all use cases can be solved in this way (for example, session-based personalization), this pattern is used in some of the most popular recommenders in industry, such as movie recommendations at Netflix.com.

*Offline training — offline serving MLOps pattern as time unfolds.*

At a first glance, our project is split into four major parts:

*The four main steps in the pipeline — DataOps, Training, Testing and Serving — together with the tools and libraries involved.*

DataOps (jump to the code): the raw data in the warehouse gets ready for feature engineering: dbt builds a DAG of transformation from SQL queries, allowing us to cheaply run aggregations on millions of rows leveraging the scalability of compute of modern data warehouses (in our example, Snowflake); we separate general joins and aggregation (where dbt and SQL are more natural) from feature-ready data points (where Python and NVTabular are more natural);
Training loop (jump to the code): NVTabular prepares user and item features, and, then, we train the Two-Tower models with Merlin Models library. Metaflow allows us to scale both “up” and “out”: “up”, as the batch decorator makes GPUs available on demand through the connection with AWS Batch, “out” as the “foreach” syntax enables training of multiple models in parallel for hyperparameter tuning. Once we pick the best configuration, everything is automatically saved in the Metaflow data store. Note that it is also easy to connect services providing experiment tracking dashboards (in our example, Comet);
Testing (jump to the code): the best model gets loaded back into the pipeline, and runs on the hold-out set to produce a final measure of generalization;
Serving (jump to the code): the best model is asked to make k predictions for each shopper — predictions are versioned, so that we know which dataset and weights generate which suggestions. Predictions are also stored in a PaaS key-value store (e.g. dynamodb), that can be easily queried from a Python micro-service (e.g. AWS lambda).

Aside from Merlin, the main open-source tools in the project are

Metaflow: a user-friendly API that allows data scientists and machine learning engineers to develop and deploy machine learning applications autonomously, while having easy access to other layers of the stack such as compute and orchestration.
dbt core: a command line tool that helps analysts and engineers transform data in their warehouse more effectively.
serverless: a framework that simplifies the deployment of serverless functions on AWS Lambda.
RecList: an open-source library to help scaling slice-based and behavioral testing for recommender systems. As RecList beta gets released this quarter, the project will soon include behavioral tests for a more rounded evaluation of model performance.

Now that the overall structure is clear, let’s dive deep into a few parts of the projects.

Show me the code!

Training a full production-ready pipeline that could scale to millions of users automatically in ~700 lines of code is possible only through thoughtful abstractions for modeling and infrastructure. Let’s tackle modeling first.

Model Building and Training

Our recommender model is the Two-Tower architecture, a popular and scalable retrieval model. It is built using the Merlin Models library in a few lines of code. The Merlin Models library provides high-level APIs for classic and state-of-the-art deep learning architectures for recommender models with an aim for high-quality implementations. TwoTowerModelV2 is one of the high level building blocks that lets users build a Two-Tower architecture with just a couple lines of code. The model is composed of the query (user) tower and item (candidate) tower as encoder blocks. During training, the model produces scores by computing a dot product between the output of the query tower and the output of the item tower for the positive item and sampled negative items. It uses a contrastive loss, where negative samples come from other examples within the batch (a.k.a, in-batch sampling). The Encoder blocks are created using InputBlockV2 which is one of the reusable building blocks that one can find in Models library. It is used as the entry block of the model to process input features from a schema object. The input block function creates the continuous and embedding layers based on the metadata stored in the schema file, which is automatically generated from an NVTabular workflow (see more detailed example). The InputBlockV2 decouples the embedding tables creation, which by default uses Keras embedding layer, giving more flexibility for the users to define their own embedding layers if needed. It provides support both for list/multi-hot and non-sequential features.

# train the model and evaluate it on validation set
user_schema = train.schema.select_by_tag(Tags.USER)
user_inputs = mm.InputBlockV2(user_schema)
query = mm.Encoder(user_inputs, mm.MLPBlock([128, 64]))
item_schema = train.schema.select_by_tag(Tags.ITEM)
item_inputs = mm.InputBlockV2(item_schema,)
candidate = mm.Encoder(item_inputs, mm.MLPBlock([128, 64]))
model = mm.TwoTowerModelV2(query, candidate)
opt = tf.keras.optimizers.Adagrad(learning_rate=self.hypers['LEARNING_RATE'])
model.compile(
    optimizer=opt, 
    run_eagerly=False, 
    metrics=[mm.RecallAt(int(self.TOP_K)), mm.NDCGAt(int(self.TOP_K))],
)
model.fit(
    train, 
    validation_data=valid, 
    batch_size=self.hypers['BATCH_SIZE'], 
    epochs=int(self.N_EPOCHS)
)
self.metrics = model.evaluate(valid, batch_size=1024, return_dict=True)
print("\n\n====> Eval results: {}\n\n".format(self.metrics))
# save the model
model_hash = str(hashlib.md5(self.hyper_string.encode('utf-8')).hexdigest())
self.model_path = 'merlin/model{}/'.format(model_hash)
model.save(self.model_path)
print("Model saved!")
self.next(self.join_runs)

Once our Two-Tower retrieval model is trained, we can now easily generate top-K recommendations (see the code) for offline batch prediction. For that we extract a topk_model via to_top_k_encoder() method of the Merlin Models RetrievalModelV2 class. The topk_model includes the query tower and the cached output of the item tower for all items. In practice, this method applies the candidate_encoder on the provided candidate_features dataset to build the top-k index of the topk_model. In other words, we first generate the query embeddings by applying the query encoder to the raw inputs, and, then, we compute the dot product between the resulting query embedding and each cached item embedding stored in the top-k index. We then sort these dot product scores and return the item ids with top highest scores.

def get_items_topk_recommender_model(
    self,
    train_dataset, 
    model, 
    k: int
):
    from merlin.models.utils.dataset import unique_rows_by_features
    from merlin.schema.tags import Tags
    candidate_features = unique_rows_by_features(train_dataset, Tags.ITEM, Tags.ITEM_ID)
    topk_model = model.to_top_k_encoder(candidate_features, k=k, batch_size=128)
    topk_model.compile(run_eagerly=False)

    return topk_model

On the serving side, we leverage the offline serving pattern: fashion recommendations for our test users are computed from the best Merlin model and stored in DynamoDb without leaving the Python pipeline (jump to the code). At inference time, any endpoint (in our case, AWS lambda) can perform a look-up based on the user id, and serve at scale the recommendation we computed (jump to the code).

Infrastructure

Let’s now see how infrastructure is handled by the pipeline. Metaflow takes a declarative approach to define the “steps” in the pipeline, that is, a step is a node in the pipeline DAG, specifying how a particular task needs to happen: training, testing, etc. By decoupling the modeling code from the execution environment and from the hardware, Metaflow allows for iterating on the same code locally or in the cloud, and quickly moving from experiments to full-scale pipelines.

@enable_decorator(batch(
    gpu=1, 
    memory=24000,
    image='public.ecr.aws/outerbounds/merlin-reasonable-scale:22.11-latest'),
    flag=os.getenv('EN_BATCH'))
# NOTE: updating requests will just suppress annoying warnings
@pip(libraries={'requests': '2.28.1', 'comet-ml': '3.26.0'}) 
@magicdir
@step
def train_model(self):
    """
    Train models in parallel and store artifacts and validation KPIs for downstream consumption.
    """
    import hashlib
    from comet_ml import Experiment
    import merlin.models.tf as mm
    from merlin.io.dataset import Dataset 
    from merlin.schema.tags import Tags
    import tensorflow as tf
    # this is the CURRENT hyper-param JSON in the fan-out

In our training step, the GPU and memory parameters request appropriate cloud resources and the image parameter together with the pip decorator specifies the Python environment that should run the code. Note that Merlin code works both on CPU and GPU, which allows for quick local testing on inexpensive hardware followed by full-scale GPU training in the cloud with no code changes (the code was tested on AWS Batch with both hardware configurations using the merlin-tensorflow:22.11 image).

What happens when we are happy and want to run this in production, on a schedule, at scale? Metaflow provides us with a one-liner to take our pipeline and run it entirely through cloud resources using Step Functions.

In other words, we automated away significant development, deployment, and maintenance work. While large corporations can afford devoted teams and custom systems, a considerable number of ML use cases happen — and will happen more and more — at the Reasonable Scale, i.e. companies in the middle of the market that could benefit from machine learning, but would not develop a platform from scratch.

The good news is, they don’t have to: open source frameworks like Merlin and Metaflow — together with the power of cloud PaaS (Platform as a Service) services — can provide the backbone of sophisticated, scalable, yet understandable recommender pipelines: after all, it turns out that they don’t need a bigger boat.

What’s next?

In a recent survey over the state of ML deployments, Shankar et al. identify Velocity, Validation and Versioning as the three main ingredients for the success of an ML project. While of course each project has its own specificities, and more work needs to be done to cover other use cases (e.g. offline training, online serving), this project is a first attempt to provide principled answers to these challenges:

Velocity: Merlin provides convenient abstractions to quickly iterate on modeling;
Versioning: Metaflow versions each pipeline execution, including data, models and documentation;
Validation: by adding RecList, we will catch more bugs and mistakes early on, and help prevent bad experiences for our users.

If you found this valuable, we will also be doing a live coding session going through the entire production-ready pipeline. If this is of interest, you can sign up here!

Acknowledgements

This project was started by Jacopo Tagliabue (at Coveo when the project started) for the NVIDIA Rec Summit 2022, and has evolved considerably since. Many people helped us make it better: in particular, we wish to thank Hamel Husain (at Outerbounds when the project started) , Valay Dave from Outerbounds, Dhruv Nair from Comet, and Even Oldridge from NVIDIA Merlin.