Video Recommendations at Joyn: Two Tower or Not to Tower, That Was Never a Question

Integrate context into your recommender system to make your users and hence yourself happy.

Published in

ProSiebenSat.1 Tech Blog

12 min readDec 12, 2023

When you use a video streaming platform, the most important question for you is probably “How do I find something cool to watch?”. Most likely, you will watch something you already know or that is currently popular. However, how can you go beyond that? How can you discover awesome content that fits your taste and mood from a huge catalog without endless searching or browsing? This is where recommender systems enter the game.

A recommender system aims to simplify content discovery using algorithms and data about your behavior, preferences, and viewing patterns. By doing this, the system can offer highly personalized suggestions, ultimately enhancing your experience and ensuring that everyone easily finds and enjoys content that aligns with their unique tastes and interests.

In this article, Victoria (Associate Data Scientist), Nikita (Senior Data Scientist), and Simon (Vice President Engineering) give you the Why, What, and a glimpse of the How of the recommender system we built for our streaming platform Joyn. On a high level, it is a context-aware neural network-based online prediction system based on Nvidia’s bleeding edge open-source recommender system framework Merlin. Sounds fancy, doesn’t it? Are you curious? Our recommendation for you is to read on!

Recommender System Foundations

Modern recommender systems can be roughly clustered into three categories:

Content-based filtering systems: They suggest items based on metadata of videos you have shown interest in or liked in the past.
Collaborative filtering systems: They predict a user’s preferences by leveraging the preferences and behaviors of similar users. Preferences can be either present explicitly via ratings or implicitly via users having interacted with items. The latter is most commonly used in practice as ratings are rarely given.
Hybrid systems: They combine both metadata and user interaction data.

At Joyn, we historically employed a collaborative filtering approach based on implicit user feedback and Matrix Factorization. This brought us quite far in serving high-quality personalized video recommendations. However, this approach has three major limitations:

It doesn’t allow you to easily integrate context information to come up with recommendations. By context, we mean information like the device you are currently using, the time of the day when you request a recommendation, or your most recent video views. Why is that information valuable? It allows one to adopt recommendations towards that context. So in the morning on your mobile phone, you most likely want to watch something different compared to in the evening on your big screen device.
Directly adding metadata into model training is not feasible; interaction data is the sole input. Other data can only be included as post-processing in an ad hoc way. Yet, with abundant content metadata and the rise of Large Language Models (LLMs) and text embeddings, including this information can significantly enhance recommendation quality.
It is almost impossible to get recommendations for users not present during training, which is known as the user cold-start problem.

Now, what should we do? How can we address the above-mentioned shortcomings and serve the most suitable recommendations to all our users? This is where neural network-based recommender architectures enter the game.

How do neural network-based systems help? Roughly speaking, you input features such as interaction data, context information, or metadata, and the network determines the optimal combination through its learning mechanisms. Adding new features or context consequently only means providing more data without redesigning your entire architecture. That’s cool, isn’t it?

Several neural network-based recommender architectures exist, which you can read about here. However, we cannot freely choose an architecture as we have one constraint; context information about our users is only available at request time. That means that we have to compute recommendations in real time without introducing large latencies. Hence, we need a very efficient system.

Luckily, there is one architecture that caters to all of that: Two-Tower Recommender systems.

What is this you may ask yourself? That’s what we cover in the next section.

Two-Tower Recommender System

The idea behind the two-tower model architecture is that there are two separate neural networks, called towers. You can interpret these two towers as separate models one representing the users, known as the query tower, and one representing the items, known as the candidate tower. During training, each tower learns to transform an arbitrary set of input features into vectors known as embeddings. The dimension of these embeddings must be the same for both the users and the items as finally the similarity between them is measured using the dot product.

Two-Tower Architecture (Picture by the authors drawn with excalidraw)

As you can see from the picture above, user- and item information interact solely in the final stage of the system. This characteristic enables effective model training and makes two-tower networks a perfect candidate for real-time inference.

But why is the architecture efficient when it comes to inference? During inference, we can make use of the fact the item embeddings are fixed. Hence, we can precompute them after training and load them at inference time, which means we don’t have to execute any neural net for items during inference.

User embeddings, however, depend on context. Therefore, we have to compute them on the go — so a user gets different predictions depending on i.e. the time of day. However, we only have to do this for one user at a time, which is again fast to compute.

Last, we calculate the dot product between the loaded item embeddings and the computed user embeddings and arrange the results in descending order based on the scores. Again, a very efficient and fast operation.

As of now, we are just speaking very high level. Next, we dive into the Merlin Framework, which we use to bring all that to action.

Using the Nvidia Merlin Framework

Photo by Elisabeth Pieringer on Unsplash

The most important part when building a recommender system, or any Machine Learning system, is having the necessary data and being able to test how your data, features, and model architecture work in practice. Having standardized building blocks greatly streamlines this process, relieving you from implementing them and enabling you to focus on delivering value. This is the core offering of NVIDIA’s open-source Merlin framework, which furnishes user-friendly APIs for developing scalable recommender systems ready for production.

Concretely, Merlin is a set of high-level Python libraries built around PyTorch and Tensorflow that run on both CPU and GPU. The main libraries we use are Merlin NVTabular and Merlin Models.

NVTabular is a library to build scalable feature engineering pipelines, also known as workflows. One core benefit of it is that it ensures that feature engineering is done consistently between training and prediction.

Merlin Models is a library that provides standard components for recommender systems, including classic Matrix factorization-based ones but also neural-net-based ones like the Two-Tower Recommender. You can use all components out of the box, building a training pipeline with about 10 lines of code. But, you also have the flexibility to add your own components or extend existing ones. That’s all pretty nice.

Looking at the downsides, as always with new frameworks, you have to consider that finding help when running into issues is difficult due to a small community and improvable documentation. Hence, when you find bugs, and be sure you will, you have to be able to help yourself. When asking ChatGPT about Merlin, you will probably end up reading things about wizards. Still, we think this framework is the right choice and has a bright future with more people adopting it.

Data, Training, and Serving

Training and Serving architecture (Picture by the authors drawn with excalidraw)

The Dataset

The key information we feed into the model is user-item interactions. As we have many shows and series on our platform, we have multiple interactions, one per episode, with the same content. We do not aggregate these on a user/series level but keep each interaction as one event. This approach has proven effective for us, possibly because the number of interactions reflects a user’s genuine interest in the content.

We enrich the interaction data with both user and item metadata. User features include contextual information i.e. device, time of day, day of the week, and demographics.

For items, we add categorical information such as genres or type (movie or series). But an item is best characterized by its textual description, hence, it is beneficial to use that information too. How can we do that? We can do that by creating text embeddings and adding those to the model as untrainable weights, helping the model to better understand item similarity. Text embeddings we create via a Hugging Face LLM, but you are free to choose what to use.

Another thing to mention is the so-called “user-take” features. Those involve incorporating a user’s nmost recent interactions as additional model features. NVTabular can encode these jointly with the main item-ID feature, creating a connection for the model. This approach has multiple benefits.

First, it provides additional context about a user’s current interests and steers the recommendations toward that.

Second, this helps address the cold start problem, offering recommendations for new users right after their first interaction without having to retrain the model. For that to work, we take a share of users’ data and remove their user-ID. This allows our model to learn to make predictions solely based on user takes, and hence mitigate the cold start problem.

To combine and execute all these steps into a single processing pipeline, we use NVTabular. It also takes care of handling missing values or encoding categorical features with a minimal amount of code. This is exactly what we love about Merlin — it hides a lot of boilerplate from you when it comes to setting up and exporting a machine-learning pipeline. Hence, it lets you focus on actual feature and model development.

Let’s have a look at that in the next section.

Model Training

Before we train anything on our dataset, we need to build a model. This includes taking all our user and item features, splitting them correspondingly between the query- and item tower, creating input embedding layers of the correct size for every feature, adding a couple of dense layers, and defining a loss function. Quite a lot isn’t it?

Luckily, Merlin abstracts all of that away from us. When we apply our NVTabular preprocessing pipeline to our training events, the underlying schema object of the NVTabular Dataset will have all the information necessary for Merlin to create the correct inputs for your model. Last, you have to choose the model architecture and the loss function from the Merlin model package.

But, there is one more thing to consider. Our dataset contains interactions between users and items, which we all treat as positive samples, so something a user likes. However, we also have to show the model negative samples, so items a user does not like. Why? Otherwise, the model would just learn to always spit out a perfect positive score independent of its input. This would be pretty useless, wouldn’t it?

One approach to get negative training examples is to use negative sampling. In that, for each positive sample, we select nitems that a user has not interacted with as negative examples. Various methods exist on how to draw these samples from your data. The most efficient one that also works well in practice is In-Batch negative sampling. Here, you select samples from the current training batch and not from the entire training set. For this to work, your training set must be randomly shuffled and not sorted by user-ID or item-ID.

Considering the implementation of sampling, Merlin offers an interface to add any sampler of your choice. It also comes with a set of predefined samples, including In-Batch negative sampling.

With this, we have everything ready to train our two-tower recommender. Exciting!

For fast training and maximum flexibility, we use Google Cloud Vertex AI Custom Jobs. Here, you can run your customized Docker containers based on prebuilt vertex ai base images, which gives you flexibility, on a GPU-powered machine of choice, which makes it fast. The resulting model artifacts we store in a Google Cloud Storage (GCS) bucket. One important fact to note is that we store the network weights for the query tower, i.e., the user model. However, for the items we compute the output of the nets, i.e. the item embeddings, and store only them on GCS.

We retrain the model daily to incorporate information about new users and new items. Training Done!

Now, what do we do with those stored artifacts? That’s what comes in the last section.

Serving Recommendations to Our Users

With our trained model, we want to compute recommendations in real-time. Sounds simple doesn’t it? Theoretically, it is. However, to bring this to a scalable production-grade solution serving millions of requests per day, we had to invest quite some brain and coding power here, which deserves its own article, which you find here. Nevertheless, here we give you a high-level overview of our solution.

Serving details(Picture by the authors drawn with excalidraw)

We have a FastAPI-based HTTP API that accepts requests for recommendations. Those requests contain information about the user. With these user features, we compute user embeddings given the trained Query Tower. To do that, we use Nvidia’s Triton Inference Server, an optimized application to execute model inference with low latency. This we run as a second process next to our API application within the same Docker Container. Crazy stuff isn’t it?

In the FastAPI app, we load the item embeddings on startup. After calculating user embeddings, we determine final recommendations by computing the dot product with the item embeddings and sorting the scores in descending order.

The entire application is hosted on Google Cloud Run. As loading times are rather long due to loading the model and initializing Triton, Cloud Run is not the best choice considering the cost/benefit ratio. Hence, we will move to bare-metal Google Compute Engine (GCE), but this is just an implementation detail.

As pointed out earlier, we periodically retrain our model. This means we have to update the model artifacts used in the container to make use of a fresh model. Doing this without incurring service downtime or high latency is a non-trivial task, and we will cover it in depth in the serving article. But in a nutshell, there is a background thread running that observes file changes on GCS and is responsible for reloading the item embeddings and the Triton instance such that those are in synch.

And that is the whole story, of how we build a context-aware real-time recommender system for our stream platform Joyn using Nvidia Merlin. Feels like magic!

Wrap Up

Matching recommendations to a user’s context is crucial to offer them the most suitable choices at any given point in time. In this article, we at Joyn have given you a high-level overview of why we have chosen the Two-Tower recommender architecture for that purpose, how we build and train it using NVIDIA Merlin, and how we serve contextualized recommendations in real-time using Triton and Merlin. Stay tuned for our follow-up article on the details of efficient model serving.

Thank you for following this post. As always, feel free to contact or follow us for questions, comments, or suggestions either here on Medium or via LinkedIn Victoria, Nikita, and Simon. 🙌