Building ranking models powered by multi-task learning with Merlin and TensorFlow

Published in

NVIDIA Merlin

7 min readMar 13, 2023

Large online services like social media, streaming, e-commerce, and news provide a very broad catalog of items and leverage recommender systems to help users find relevant items. Those companies typically deploy recommender systems pipelines with multiple stages, in particular the retrieval and ranking. The retrieval stage selects a few hundreds or thousands of items from a large catalog. It can be a heuristic approach (like most recent items) or a scalable model like Matrix Factorization, Two-Tower architecture or YouTubeDNN. Then, the ranking stage scores the relevance of the candidate items provided by the previous stage for a given user and context.

It is common to find scenarios where you need to score the likelihood of different user-item events, e.g., clicking, liking, sharing, commenting, following the author, etc. Multi-Task Learning (MTL) techniques have been popular in deep learning to train a single model that is able to predict multiple targets at the same time.

By using MTL, it is typically possible to improve the tasks accuracy for somewhat correlated tasks, in particular for sparser targets, for which less training data is available. And instead of spending computational resources to train and deploy different models for each task, you can train and deploy a single MTL model that is able to predict multiple targets.

Many deep learning architectures have been designed for multi-task learning on tabular data, as seen in Figure 1.

Figure 1. Image adapted from: Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations

In this blog post, we present how to build and train a multi-task learning ranking model using Merlin Models library based on this new example notebook.

Multi-task learning architectures

A popular MTL approach is building a simple model, e.g. based on a multi-layer percetron (MLP), with a shared-bottom (Figure 1.a) layers and task-specific towers, with their head and loss. This approach tends to be straightforward, but it limits the model to have a single representation of the inputs for all tasks.

The Multi-gate Mixture-of-Experts (MMoE) architecture (Figure 1.b) was introduced by Google in 2018 and is one of the most popular models for multi-task learning on tabular data. It allows parameters to be automatically allocated to capture either shared task information or task-specific information. The core components of MMoE are experts and gates. Instead of using a shared-bottom for all tasks, it has multiple expert sub-networks processing input features independently from each other. Each task has an independent gate, which dynamically selects based on the inputs the level with which the task wants to leverage the output of each expert. The gate is typically just a small MLP sub-network that provides softmax scores over the number of experts given the inputs. Those scores are used as weights for computing a weighted average of the experts’ outputs and form an independent representation for each task.

The CGC and PLE architectures were introduced in 2020. The authors observed that architectures like MMoE presented a seesaw or negative transfer phenomenon, where improving the accuracy of one task hurts the accuracy of other tasks compared to single-task learning models. So instead of having tasks sharing all the experts, they proposed allowing for splitting task-specific experts and shared experts, in an architecture they named Customized Gate Control (CGC) Model (Figure 1.c).

Furthermore, the paper’s authors proposed stacking multiple CGC models on top of each other to form a multi-level MTL model, so that the model can progressively combine shared and task-specific experts. They name this approach as Progressive Layered Extraction (PLE) and it can be seen in (Figure 1.d). Their paper experiments showed accuracy improvements by using PLE compared to CGC.

Let’s practice

You might be thinking that building those advanced models would be a hard task. Fortunately, Merlin Models API provides low-level and high-level building blocks (Keras custom layers) that makes building such RecSys models much easier! These snippets are based on an example notebook where we present all the detailed steps to generate synthetic data, build and train an MTL ranking model.

Preparing data and schema

Let’s start with an example (Figure 2) of how the input dataset is expected to look like for a ranking model with multiple targets based on the columns provided in the public TenRec dataset.

A dataset for ranking typically includes item features and user features (static or contextual), as well as a target column, which can be either a binary target, e.g. whether the item is relevant for the user, or continuous target for regression, e.g. what is the watch time of a video or read time of an article. In our case, multiple target columns are available.

You might choose to have a separate model for each target, or use multi-task learning to train a single model for all of them. This latter approach typically leads to better accuracy and reduced engineering and resources overhead to train and deploy multiple models.

Figure 2. Sample (synthetic data) based on the columns from the TenRec dataset.

We have created a preprocessing workflow for the TenRec dataset with Merlin NVTabular. It outputs the preprocessed data in parquet format, accompanied by a schema file. The schema is an important concept in Merlin, as it contains metadata about the available features in the dataset, including the type (categorical or continuous), whether they are simple or list/multi-hot features, what are the target features, among others. You can also generate the schema programmatically using Merlin API if you use another library for preprocessing.

For our example, we use synthetic data generated based on the schema we obtained when preprocessing the TenRec dataset with a NVTabular workflow.

Modeling

Let’s move to building our first multi-task learning model with the Merlin Models library, which is built on top of Tensorflow Keras.

We start implementing a simple MLP shared-bottom architecture (Code 1). It can be done by just connecting 3 building blocks: InputBlockV2, MLPBlock and OutputBlock.

The InputBlockV2 automatically creates the necessary layers for representing the input features based on the schema we introduced before. For example, the embedding tables of categorical features are automatically built based on the feature cardinality and on an embedding size inferred from the cardinality. Defining the embedding size for each categorical feature is already possible by setting the dim argument. The InputBlockV2 is also responsible for combining (e.g. concatenating) the input features. The MLPBlock is just a simple MLP.

The OutputBlock creates a head and a loss function for each task based on the schema. In our example, as multiple targets are available, it creates multiple heads and losses (binary cross entropy for binary classification or mean-squared error for regression) for joint training (multi-task learning).

All those building blocks are configurable, here we just use the default options for simplicity. You can learn in this example notebook more about advanced options of the multi-task learning API, like how to set the loss weights and specify task-specific class/sample weights.

Code 1. Building and training a shared-bottom MLP model

As we described earlier, the MMoE architecture creates multiple experts (sub-networks) to process the inputs, and each task gate can do different weighted combinations of experts’ outputs. For that, we replace the MLPBLock of the previous shared-bottom model by an MMOEBlock. It is configured by the number of desired experts and by the type of block you want to use for experts and gates. In this case we use MLPBlock, just like originally proposed in MMoE paper.

Code 2. Building an MMoE model

Differently from MMOE, the CGC model allows reserving some experts to be task-specific. So you will notice that CGCBlock has separate arguments for num_task_experts and num_shared_experts.

Code 3. Building a CGC model

Finally, the PLE proposes stacking multiple CGC on top of each other, forming a multi-level MTL model. The PLEBlock has an additional num_layers argument, which controls the number of levels.

Code 4. Building a PLE model

As the model we built inherits from Keras model, we proceed compiling it and calling model.fit() for training, just as you do with regular tf.keras.

In this example, we use the Merlin dataloader with a Dataset (train_ds) generated with synthetic data as input.

Code 5. Compiling and training a model with Keras

Conclusion

In this post you have learned how to easily build and train state-of-the-art multi-task learning ranking models with the Merlin Models library.

You can use our latest (23.02) Merlin TensorFlow image on NGC to run this example notebook and try it with your own dataset.

You can learn more about other options and model types (e.g. retrieval models, sequential models) available in examples of Merlin Models.

You might check this example on how to use NVTabular to build preprocessing workflows, and how the schema object allows its seamless integration with the Models library.

As a follow up blog post, we are going to publish another one introducing some new resources (template scripts and documentation) to make it easier to understand how to train ranking models using your own data and some empirical results on public datasets to help you choose a model and hypertune it. Stay tuned.

Acknowledgements

The multi-task learning API of Merlin Models library is based on a preliminary implementation by Mark Romeyn and on the internship project of Gaoge Liu, who did a great job researching, prototyping and experimenting with MTL models at Merlin team. Also a special thanks to Sara Rabhi and Ronay Ak, who collaborated with the discussions on experiment results with MTL models.