Solving the Cold-Start Problem using Two-Tower Neural Networks for NVIDIA’s E-Mail Recommender Systems

Benedikt Schifferer
NVIDIA Merlin
Published in
7 min readJan 11, 2023

There are many great resources and examples for machine learning. But they are often based on public (and sometimes even toy) datasets. The real challenge is to build a model, which works for your project. We want to share our learnings and insights of developing an inhouse recommendation system for E-Mail campaigns at NVIDIA. In the previous blog posts, we provided an overview, the challenges and project goal and a detailed view on data preprocessing. In this blog post, we focus on training a two-tower neural network architecture. We explain how to start modeling, addressing the cold-start problems, how to improve the model and the obstacles we had.

Two Tower Architecture to Solve the Cold Start Problem

Image Adapted from: Off-policy Learning in Two-stage Recommender Systems

First, we evaluated which models could be a good fit to our problem. In traditional recommender models, we can train a model on all item and user features. The item and user IDs are unique identifiers and are often modeled with an embedding layer. They represent specific behavior and characteristics. In our project, we have an extreme cold start problem, which we described in our first blog post. All items are unknown and a significant number of users are unknown for the prediction period. Therefore, we need to carefully select the features and model for our problem.

We realized quickly that the concept of a two tower architecture fits our problem well: alluser features are processed by one MLP tower (user tower), creating the user representation (user embedding) and all items are processed by another MLP tower (item tower), creating the item representation (item embedding). The output is the dot-product of the user and item embeddings, a single number representing the score that the user will interact with the item. The trick for our project is that we do NOT use the item and user IDs and instead select only features, which are available any period.

Data Science is a Science: Start Small and Iterate Fast

Science is defined as “the systematic study of the structure and behavior” in the Oxford dictionary. We recommend starting small and iterating fast. The approach enables us to build an initial, simple and working pipeline. Further, it helps us to understand which features or model techniques improve the quality.

First, we start with only a selected few features to build our training and evaluation pipeline. Initially, we used industry, job role and interest of the users as user features and industry segment, audience level and primary topic as item features. The features have a meaningful connection as for example users with specific interest should attend items with a specific primary topic.

To process input data and easily build a Two-Tower model, we use NVIDIA Merlin. Merlin is NVIDIA’s open source framework to develop and deploy recommendation systems. It provides a high-level API to define end-to-end pipelines from feature engineering to training and deploying a model. We use NVTabular to preprocess the dataset: Categorifying categorical columns and adding tags to each column (e.g. if it is a user or item column).

You can learn more about NVTabular’s API in the documentation.

Merlin Models provides high quality implementations of standard recommender models. We use Two-Tower models with the TensorFlow backend. The code below demonstrates how easy it is to train the architecture.

NVTabular and Merlin Models are connected via schema files, which provides meta information, such as the column types or the cardinality for the embedding tables. We provide more information as an example on GitHub. You might notice that we only have positive interactions (a user attended a session), but we do not have negative ones. Therefore we need to leverage a negative sampling technique. Merlin Models provides InBatch negative sampling strategy to train a two tower architecture. We keep ITEM_ID in the dataset for the InBatch negative sampling. As we tag ITEM_ID not as “ITEM” and use only the “ITEM_ID” Tag, Merlin Models does not use it as an input feature. It would have taken us much more time to develop all components from scratch without NVIDIA Merlin.

Learnings from the Initial Pipeline

So far, we outlined our initial training pipeline in the blog post. NVIDIA Merlin helped us to quickly develop it. Starting simple was essential. We will share some of our learnings, bugs and issues in the first phase.

1. Perfect Evaluation Score: You might notice that we apply the NVTabular .fit function on the full dataset. In our first version, we fit the statistics only on the training dataset. The Categorify Op will map unknown values to the same category (ID=0). As every GTC has mutually exclusive items, all validation item IDs are new and therefore Categorify Op will map them to ID=0. The validation process will score perfectly as all items have ID=0. We fixed it by fitting the full dataset. Our data processing pipeline does not introduce data leakage in that case. Be careful, if you apply this strategy.

2. Evaluating the Full Item Catalog: We experienced that our evaluation metrics are not stable. We identified that using InBatch Negative Sampling for evaluation is not equivalent to what happens in prediction period. Negative Sampling will sample more often popular items — we do not have information about popularity of the items in the test period. Therefore, we need to score all items during validation. We provide an example to score the full item catalog here.

3. Initial Hyperparameters: Neural Networks are sensitive to hyperparameters. The initial pipeline can be used to find good values for learning rate, number of layers, hidden units. An important hyperparameter is to remove the activation function before the dot product by using no_activation_last_layer=True.

We established a baseline model and accuracy with only 6 features. Adding more features or another architecture would make the process slower and more difficult to debug the issues.

Improving the Baseline Model

As we have an initial pipeline, we can iterate on the pipeline to improve the model. We share some of our key improvements:

1. Adding all features: A quick win is to add all remaining features to our model architecture. NVTabular, Merlin Models and the schema file makes it easy to update the code by just adding the features to the NVTabular workflow.

2. Adding Text Embeddings for title and abstract: Although all items are new during the prediction period, each item is described by a title and an abstract. We can use a pre-trained language model (e.g. BERT) to extract text embeddings. We can initialize an embedding table in the two tower architecture with the text embedding similar to our GitHub example. It is important to freeze the embeddings, otherwise the embeddings are updated only in the training period and do not fit the distribution in the prediction period.

3. Adding Watched User History: Although a significant number of users are new for each event, we have a significant number of recurrent attendees. We developed a feature of primary topics a user watched in the event prior to the current example. It is important that there is no data leakage and for every example only historical information is used for the feature. The output is a list of IDs and combines the embeddings via averaging.

4. Limiting Negative Samples: InBatch Negative Sampling selects the negatives from the current batch. In the default behavior with shuffle, the negatives can be selected from another event. For example, the positive datapoint is from GTC Fall 2021 and the negative datapoint is from GTC Spring 2022. Both items could be relevant to the user, but the user did not attend GTC Spring 2022. Therefore, the negatives are incorrect. False negatives cannot be avoided, but we can limit the effects. We modified the pipeline so that negatives can only be drawn from the same GTC event. If the positive datapoint is from GTC Fall 2021, then the negatives for this datapoint are from GTC Fall 2021. We implemented the strategy by partitioning the dataset by GTC event and do not use shuffle.

Summary

In this blog post, we provided some code snippets to train a two-tower architecture and walked through our experimentation process. We shared our key learnings in developing our initial training pipeline and major iterations to improve the model. It is important to start simple to understand the improvements of each new feature.

If you are interested in two-tower architecture, checkout our full example on GitHub. Another relevant example is how to use pre-trained embeddings with Merlin Models. We are working on the next version of our internal recommender systems — follow our blog to get informed.

You can find more information about the full project in our prior blog posts: Overview, the challenges and project goal and a detailed view on data preprocessing .

Team

Thanks to the great team developing the in-house use-case: Angel Martinez, Pavel Klemenkov, Benedikt Schifferer

--

--

Benedikt Schifferer
NVIDIA Merlin

Benedikt Schifferer is a Deep Learning Engineer at NVIDIA working on recommender systems. Prior, he graduated as MSc. Data Science from Columbia University