Scale faster with less code using Two Tower with Merlin

Published in

NVIDIA Merlin

8 min readJun 21, 2022

by Radek Osmulski, Benedikt Schifferer, Ronay Ak and Gabriel Moreira

Building recommender systems can be quite challenging. When we talk about recommender systems we often focus on providing the most relevant recommendations to the users. Think of YouTube recommending you the next video to watch or Amazon suggesting related products you might like.

But recommender systems in the real-world have two other major tasks that they need to accomplish which can be quite demanding. They might be required to deliver a recommendation in milliseconds to warrant a good user experience. That, as we will see, might require a significant amount of creativity and engineering. And the second consideration is that we might want to minimize infrastructure costs while solving the latency issue, which is yet another obstacle to overcome!

Part of the problem is that there often are millions, or even billions, of items from which we can select recommendations. For example, imagine how many songs a music streaming website has. At such an item catalog scale, scoring every item with a complex ranking model is not possible, given latency constraints. We need to first reduce the number of candidates to several hundred or thousands of items. Only then does the task of calculating a score for each item using a more complex model becomes within reach.

To accomplish this, recommender systems are often a pipeline of multiple stages, each specialized in solving a specific problem: retrieving candidates, filtering, scoring (ranking) and ordering. If you want to learn more about each stage, you can read our other blog post “Recommender Systems, Not Just Recommender Models”.

In this blog post, we will focus on the candidate retrieval step. Common choices for retrieval models are Matrix Factorization and Two-Tower architecture. We will take a closer look at Two-Tower, which leverages user and item features for retrieving relevant candidates with high performance during inference.

Without further ado, let’s get started.

The Two Tower architecture

The Two Tower architecture consists of two major components, the item tower, and the user/query tower.

Image Adapted from Off-policy Learning in Two-stage Recommender Systems

User features and item features are fed into the model through the corresponding towers. They interact with each other only before the last step in the computation, i.e., scoring. As we will see, this characteristic of the architecture is very important for better performance during inference.

Once our features pass through each of the towers, that is where the magic begins to happen.

At the top of each tower, we obtain multidimensional embedding vectors, representing the user and item features. We subsequently compute the dot product between user and item embedding vectors to produce recommendation scores. In other words, we compute the score both for positive items (i.e., items interacted by the users) and sampled negative items (read more about negative sampling here and here), whose item embeddings are also multiplied by the user embedding. Then a loss function (e.g., categorical cross-entropy, BPR) is applied on top of the positive and negative scores.

And that’s it!

While the Two Tower architecture continues to be an active area of research (especially in terms of how the loss is calculated and how negative examples are presented to the model), we have now discussed the major components that comprise it.

The appeal of the model however doesn’t lie in how it’s put together, but rather in the applications that it allows.

This is what we will discuss next.

Why the Two Tower architecture and not something else?

Two Tower is an architecture that elegantly solves the challenges associated with candidate retrieval. It does so by leaving interacting the item and user/query features until the very end.

If in our architecture user and item features interacted early on, we would need to perform all these calculations either in the request-response cycle or semi-regularly offline. This might not be possible to complete in a timely manner within a request-response cycle, and doing so offline might still be very expensive (or impossible again). Think of millions of items in the inventory along with millions of users. This is a cartesian problem with users across one dimension and items across the other with us having to compute a score by performing expensive calculations for each pairing. An approach where we pass an embedding for each item and each user through a complex neural network architecture will just not scale.

But because user and item features don’t interact in the Two Tower architecture before the ultimate dotting, we can do something else.

We can precompute and cache item representations for items when they become available in the catalog or for all items at once, as they only depend on item features. In a response-request cycle we then only need to compute a single embedding — that of the user to use as our query. Once we have the embeddings, calculated independently of each other, the only operation that now remains is calculating the score.

A naive approach here would be to multiply the user/query representation by all cached items, which would be linear to the item catalog size. But using approximate nearest neighbor (ANN) search we can do even better than that. ANN engines have indices for optimized vector similarity search and the operation becomes sublinear! This is for instance the approach taken in the Deep Neural Networks for YouTube Recommendations paper.

There is another aspect of the Two Tower architecture that is very notable. It allows us to alleviate the cold-start problem for new users and new items, which is common for collaborative filtering approaches and models like Matrix Factorization, which rely only on the user id and item id.

But we have talked about the intricacies of the architecture for quite some time already. Let’s dive deeper into how you can solve the quick start problem at a later time. For now, let us look at how you can train the architecture on your own data and move it all the way to serving.

Training the Two Tower architecture

Merlin Models library allows you to train the Two Tower architecture with the following 4 lines of code!

The input features for this model can be both numerical or categorical, the latter being usually represented by embeddings. The towers’ architecture can be arbitrarily simple or complex, depending on the size and complexity of our data. Two or three Multilayer Perceptron Layers (MLP) with a performant non-linear activation function (e.g., ReLu is a popular choice) are a good place to start. Some people prefer not using the activation function in the last layer as it turns negative values into zeros (this is for instance the approach taken in Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations). The output dimension of last layers from the towers’ architectures must match, so that the dot product can be computed. A common choice is applying L2-normalization on user and item representation, so that the dot product becomes the cosine similarity. Merlin Models support all of those options.

But to paint a big picture, the tooling the Merlin framework provides integrates all the steps from data processing, through model training, all the way to model deployment (read more about model deployment here). This allows you to describe many aspects of your data only once (say, how to process a given column) and the processing pipeline, model training and finally the deployment can all leverage that information. No need to be restating the same piece of information (“this is a categorical column that I want to be embedded using a dimensionality of 64 and l2 regularization”) at every step of the process.

Not only is this convenient, but also allows you to deliver more bug free, scalable solutions!

Recommender systems in the wild

Two Tower has received widespread adoption across the industry. Let us look at some of the most challenging environments, that of Internet scale enterprises, and consider its performance.

Twitter is leveraging the Two Tower to combine earlier heuristics under a single umbrella model. This exploits the flexibility that the Two Tower offers where data of various types and origins can be fed into the respective tower. Having transitioned to Two Tower, they are reporting better relevance of results, ability to retire older heuristics that were costly to maintain and see the Two Tower architecture as a stepping stone towards faster iteration and exploration of new signals.

Pinterest begins their post by exploiting a characteristic of the Two Tower architecture, the in-batch negative sampling, to generate a high number of negative examples which in return leads to better performance on several key metrics. They further describe the architectural details of the Two Tower architecture that we covered to cut infrastructure costs. One might imagine that cutting serving cost and simplifying training complexity should lead to decreased performance, but instead something else is achieved. Pinterest is seeing 2% — 3% improvement across key engagement metrics.

The list of use cases in the industry could go on and on. Among other large enterprises, the Two Tower architecture is actively used and researched by Amazon, Ebay, Meituan and YouTube.

Try out yourself and learn more!

You can try out the Two Tower model in the examples: How to train a Two Tower model or the End-to-end example of training and deploying a Two Tower model as candidate generation and Facebook’s DLRM architecture for scoring the candidates. We used the library Merlin Models to train the Two Tower architecture with only ~4 lines of code, but the library provides much more functionality. Recommender systems are complex and diverse. Merlin Models provides implementation of common candidate generation and scoring models, different negative sampling strategies, different loss functions and much more. The library is one component of the open source framework Merlin, covering other steps such as feature engineering or deployment. If you are interested, you can learn more on the Merlin product page. On July 28, 2022, we will host the NVIDIA’s RecSys Summit 2022 online. Many guest speakers from industry will talk about the challenges to apply recommender systems to the real world. Sign up and join for free!

If you found this article interesting, do stay on the lookout for more, as there will be many more blog posts that we plan to publish!

Thank you for reading and see you next time!