Setup of NVIDIA Merlin and Tensorflow for Recommendation Models

Rubens Zimbres
Google Developer Experts
10 min readApr 10, 2023

Lately I’ve been working with the Two Towers model shared by Google and deployed in Google Cloud Matching Engine (also here). Basically the Two Towers model relates two entities, candidates-vacancies, users-items to make recommendations. It can also be used to search for content, and its deployment basically consists of four phases: retrieval, filter, scoring and ordering (NVIDIA blog).

The Google Cloud Two Towers model uses a siamese neural network with two towers as inputs that map items to users using a FactorizedTopK metric for a retrieval task, via semantic similarity. Then, an index is generated using ScaNN — Scalable Nearest Neighbors and it will return Top N candidates (retrieval) and their distances (ranking). This index can be deployed in Google Cloud VertexAI or in VertexAI Matching Engine and enables high-scale, high queries per second (qps), with low latency querying over indexes with more than a billion embedding vectors. This model uses two parallel units (user and item) and embedding layers.

This week I read the article Building ranking models powered by multi-task learning with Merlin and TensorFlow, from my fellow GDE Gabriel Moreira. Basically, the idea is that you can use one single model with different outputs and loss functions to make classifications and regressions, by adding gates and experts (sub-networks) according to the specific task that is being learned. In the article, the evolution of the MLT (multi-task learning) models is explained, Jupyter notebooks for implementing these models with Merlin are shared (available here and here), and different uses and configurations for the Merlin models are presented (MMoE, CGC, PLE).

Then I noticed the difference between Two Towers models for multi-task learning and Two Towers models for retrieval. While both have two MLP towers as inputs (for users features and items features), only the former has MLP towers as outputs for each task, each one with a specific loss according to the task (regression, classification).

This article may overlap some concepts with Gabriel’s article, but it was very important for me to write about my understanding and the differences of these recommender systems.

Two Towers Model for retrieval

This model is composed of two towers: a query tower that contains user features, and candidate tower that contains item features. Instead of a regression or classification, the idea here is retrieval of Top N best candidates, according to embeddings similarity in the multidimensional space. To achieve this we use recall, which verifies whether the relevant item is among the top-N items. After training, the trained embeddings of the candidate tower are deployed as an index and subsequent queries will search in this index for the best Top N matches.

A full tutorial for this model using Merlin can be found here, and in Tensorflow here.

Two Towers Model. Source: https://nvidia-merlin.github.io/models/main/examples/05-Retrieval-Model.html

Multi-gate Mixture-of-Experts Model

Considering multi-task learning, there is a known problem that comes from parameter sharing between tasks being learned. If the neural network have a greater accuracy for one task, this comes at the expense of a decrease in the accuracy of the other task being learned in the same model, given a different data distribution. This is called the seesaw problem. Tang et al. (2020) argue that the literature addresses the negative transfer problem (Torrey; Shavlik, 2010), but neglects the seesaw phenomenon in multi-task learning (MTL) models.

Ma et al. (2018) approach the seesaw problem with expert networks and gating networks that allow each task to use experts in a different way, trying to overcome specific task-related data distributions. Authors take advantage of modulation and gating mechanisms to improve the trainability in non-convex deep neural networks. They make multiple experiments with synthetic datasets to adopt the MMoE, using ReLU activations in the gating network, for regressions in Tensorflow. Authors achieve SOTA, by a marginal contribution.

Multi-gate MoE model. Adapted from: https://dl.acm.org/doi/pdf/10.1145/3219819.3220007

The total parameters of this model is calculated by multiplying: input dimension, number of experts, hidden layers of each expert and number of towers. Ma et al. (2018) found that the loss landscape generated by MMoE eases the trainability of the model, creating lower and more stable losses.

Customized Gate Control Model

Two years later, Tang et al. (2020) develop a new solution by separating shared components from task-specific experts by stacking Customized Gate Control (CGC) models. This setup seems to overcome the need of correlated tasks for MTL, given that parameter sharing is limited. They also argue that in real world problems, uncorrelated tasks are quite common.

This model is composed of expert modules at the bottom and task-specific tower networks at the top. Each expert module is a collection of multiple sub-networks and this is a hyperparameter that can be tuned. Each tower network gets information from its own expert and also from parameters of a shared expert.

In this paper, each gating network gets the outputs of the experts and calculates their weighted sum, submitting to an activation function. The CGC model removes the connection of a task specific tower and the expert of another specific task. So, for task k, the prediction is the function of the tower k over the gating network k. This solution can better handle task conflicts and sample-dependent correlations.

Customized Gate Control (CGC) Model. Adapted from: https://doi.org/10.1145/3383313.3412236

Progressive Layered Extraction (PLE) Model

Progressive Layered Extraction (PLE), also from Tang et al. (2020), uses multi-level task-specific experts like CGC, but also introduces a gating network in the shared experts block (blue in figure below). This architecture combines semantic representations of each task in order to improve its generalization properties, by separating task-specific parameters progressively along the deepness of the neural network. The loss of the PLE is the sum of losses of the specific task related to the ground truth, according to the parameters of that task and the shared parameters.

Progressive Layered Extraction (PLE) Model. Adapted from: https://doi.org/10.1145/3383313.3412236

Merlin is able to handle not only the MMoE architecture, but also the CGC and PLE in a very simple manner. In this other article, NVIDIA RecSys team explains the advantages of Merlin, which also implements the Two Towers model, scaling faster with much less code.

Conceptual Pipeline with Merlin. Source: https://developer.nvidia.com/nvidia-merlin

Here, my greatest challenge was to set up the Tensorflow environment to implement the Merlin model on-premises, given that I didn’t have neither RAPIDS nor Merlin installed in my computer. So, in this article, I will share the steps that worked for me.

On-prem Installation

For on-prem installation, there are two ways of running Merlin, by using a docker image from NVIDIA or by installing all packages from scratch, in Anaconda environment.

  1. Docker image

Here, I started by installing docker. However, if you already have correct cuda driver and cuda toolkit versions installed on a workstation with GPUs you will be able to pull the container and get started, without the need to install docker, nvidia-docker-toolkit and nvidia-toolkit.

After you have docker set up:

sudo docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:23.02
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo apt-get install nvidia-container-runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/override.conf <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
EOF
sudo tee /etc/docker/daemon.json <<EOF
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": [],
"default-runtime": "nvidia"
}
}
}
EOF
sudo pkill -SIGHUP dockerd

Finally:

sudo systemctl daemon-reload
sudo systemctl restart docker
docker run --runtime=nvidia --rm -it -p 8882:8882 -p 8797:8787 -p 8796:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-tensorflow:23.02 /bin/bash

Get the example notebook located at /models/examples and start Jupyter server:

cd / ; jupyter notebook --ip 0.0.0.0 --no-browser --allow-root --port 8882

In VS Code, add the Docker extension, see the running container

In Jupyter, Select kernel -> existing Jupyter server from the container and run the example notebook.

2. Anaconda

First, create a brand new Anaconda environment with Python 3.9:

conda create -n merlin888 python=3.9

I usually avoid installing Anaconda in the root folder, due to permissions.

Activate the environment:

source /home/anaconda3/bin/activate merlin888

Now we will install Tensorflow and TensorRT and create symlinks to libnvinfer.so.7 and libnvinfer_plugin.so.7 so that Tensorflow can find them:

pip install tensorflow==2.10.0
pip install tensorrt
cd /home/anaconda3/envs/merlin888/lib/python3.9/site-packages/tensorrt
ln -s libnvinfer_plugin.so.8 libnvinfer_plugin.so.7
ln -s libnvinfer.so.8 libnvinfer.so.7

Then we install cudatoolkit and cudnn via conda forge:

conda install -c conda-forge cudatoolkit=11.2.2 cudnn=8.1.0

After that, we add to ~/.bashrc:

sudo vi ~/.bashrc

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/anaconda3/lib/

ESC :wq!

Then, in installation of Tensorflow , install nvcc:

conda install -c nvidia cuda-nvcc=11.3.58
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
printf 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/\nexport XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice
cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

Now we install additional packages, like NVTabular and RAPIDS for feature engineering (https://docs.rapids.ai/install). Although pip install is in its experimental phase, it worked for me. Also, downgrade protobuf for Tensorflow compatibility.

pip install scipy
pip install nvtabular
pip install setuptools --upgrade
pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com (experimental rapids https://docs.rapids.ai/install)
pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
pip install protobuf==3.20.0

We finally install merlin and merlin-models

pip install merlin-models

Now, in a Python notebook, test if Tensorflow GPU is properly installed:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available: 1

Now we can successfully run the Merlin notebook in Python code.

Installation in Vertex AI

You can also deploy the Merlin image directly to VertexAI via docker container, which is the most straightforward method. Access:

On the top right of the page, click Deploy on Vertex AI:

Create your account on NVIDIA and click Deploy:

This will create a Managed Notebooks instance on Vertex AI. Then, you just have to install the NVIDIA GPU drivers and add a GPU T4, enough for the job:

It will take a couple of minutes to deploy the image and install libraries. Open Jupyterlab:

Merlin

For a comprehensive explanation of Merlin features, get the notebook from NVIDIA GitHub or from the container folder /models/examples.

You will see that the Merlin model basically consists of:

  • Input Block, that infers the input features from the schema, creates embeddings for categorical features and concatenates them.
  • OutputBlock that also infers the target from the columns, with the possibility of defining if it is a regression or classification, multiple outputs.
  • Customized Blocks, like CGC, MMoE and PLE.
  • MLPBlock with dense layers.

In the example notebook, you define the inputs from a schema in the dataset, given by tags. Note that in the ten-rec dataset you can do binary classification (task 1) and regression (task 2) in the same model. If you need to do some feature engineering, check this notebook.

The output block regards the towers related to each one of the tasks. In this case, each one of the towers is composed of a MLPBlock of dimension 32. As we have 2 tasks in the example notebook, we have 2 task-specific experts (one for each task) plus 3 shared experts, and each expert is composed of a MLPBlock of dimension 64. Between inputs and outputs, we have the experts.

The whole model is about 10 lines of code and the training is like any Keras training:

PLE Architecture

inputs = mm.InputBlockV2(schema)
output_block = mm.OutputBlock(schema, task_blocks=mm.MLPBlock([32]))

ple = mm.PLEBlock(
num_layers=2,
outputs=output_block,
expert_block=mm.MLPBlock([64]),
num_task_experts=2,
num_shared_experts=3,
)
model = mm.Model(inputs, ple, output_block)

model.compile(optimizer="adam", run_eagerly=False)
model.fit(train_ds, batch_size=BATCH_SIZE)

Acknowledgements

I would like to express my sincere gratitude to Gabriel Moreira and NVIDIA engineers for the guidance provided on the multi-task learning theory and the feedback on Merlin installation. Their expertise and willingness to help were vital in the improvement of this article.

References

Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations. In Fourteenth ACM Conference on Recommender Systems (RecSys ’20), September 22–26, 2020, Virtual Event, Brazil. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3383313.3412236

Jiaqi Ma , Zhe Zhao , Xinyang Yi , Jilin Chen , Lichan Hong , Ed H. Chi . 2018. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. In Proceedings of The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3219819.3220007

Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI Global, 242–264. https://ftp.cs.wisc.edu/machine-learning/shavlik-group/torrey.handbook09.pdf

Rubens Zimbres is a Senior Data Scientist at Intellimetri (brazilian startup), Phd and Google Developer Expert in Machine Learning and Google Cloud.

--

--

Rubens Zimbres
Google Developer Experts

I’m a Senior Data Scientist and Google Developer Expert in ML and GCP. I love studying NLP algos and Cloud Infra. CompTIA Security +. PhD. www.rubenszimbres.phd