Building modern recommender systems: when deep learning meets product principles

Jun 25, 2019 · 7 min read
Image for post
Image for post

Dailymotion is a video platform that hosts hundreds of millions of videos which are watched every day by millions of users. Given the size of our catalog, being able to select automatically the best videos to recommend to our users is crucial in order to engage them and drive our growth. One of the most crucial spaces where recommendation plays a great role is the “video page” where the main video is followed by a selection of related videos. We call this algorithm “video to video”.

In the past, we relied heavily on external providers that provided our entire “video to video” stack. Although they were producing fairly relevant recommendations, we felt that we could do better on two specific points:

So how did we build our new internal algorithm and increase both traditional ranking metrics while boosting our main product KPIs?

State of the art algorithms

In the past couple of years, we have seen a big change in the recommendation domain which shifted from traditional matrix factorization algorithms (c.f. Netflix Prize in 2009) to state-of-the-art deep learning based methods.

Here are also some of the reasons which made us choose deep learning techniques for our recommendation stack:

Let’s see how these ideas are connected:

Image for post
Image for post

This algorithm is designed to predict the next watch the user will make based on their recent watch history (video id + metadata) and the watching context (e.g. device and country). At the same time, it learns the probability of all the videos being displayed given the user’s recent history and its context. It enables us to get an accurate estimation of the propensity (probability of being displayed) of the video we are trying to predict.

Using propensity re-weighting, the loss for an element of the left branch of the neural network (which tries to predict the next video) is defined as :

Image for post
Image for post

which is simply a weighted (by the inverse-propensity) cross-entropy loss over a softmax.

At serving time, in order to increase the diversity of the recommended videos (and collect feedback from our users on more videos), we use a stochastic version of the softmax. Indeed, instead of taking a greedy approach (selecting the top_k videos with the highest scores), we select the highest probability videos so that their cumulative probability mass is higher than a threshold (0.9 for example). We can then form a new distribution over these videos by rescaling their probability and sample from it. This is called nucleus sampling [6] and enables us to control the amount of exploration in our algorithm (when it is very confident the nucleus distribution will contain very few videos while it will have a long tail when the algorithm is not confident in its outputs).

Image for post
Image for post
Nucleus sampling

Meeting the product principles

Quite often machine learning practitioners think of recommender systems (or any machine learning algorithms integrated into a product environment) only in term of statistical performance using traditional metrics (recall@k, map@k, NDCG@k, …). Meeting the product requirements often comes in second and is only achieved by applying hard rules to the output of the machine learning based recommender systems. This is the worst situation and exactly what should be avoided as it creates a mismatch between what the model is learning and what it outputs after applying the hard rules.

Here are two examples of the product rules we have integrated into our algorithm:

Rule 1: Promote partners which upload premium contents

At Dailymotion, most of our partners deliver high-quality content which qualifies them as Premium. As our strategy is now focused on premium content and because we noticed that great videos receive more attention, it is better to favor them in the recommendation algorithm.

As mentioned before, our algorithm is trained to predict the next video the users will want to watch given everything we know they have watched before.

Using a bandit/RL formalism, we can introduce a reward for each observation and customize it given the type of views the user has made. The loss we introduced before then becomes:

Image for post
Image for post

Setting the reward (R) to a higher value for views made on high-quality videos will result in favoring them at training time while deteriorating the relevance (recall@N) of the recommendation. This is what is shown in the following graph representing an offline experiment:

Image for post
Image for post

The goal here was to find a tradeoff between relevancy (recall@N) and the percentage of premium videos surfaced in the recommendation lists.

Rule 2: Consistency between the input video and the recommended ones

Consistency is one of the criteria that we would also like to achieve. For instance, if a user starts watching a video in English we would like to surface more videos in the same language. If a user starts to engage with a video from a given category (football for example), the following videos should follow the lead.

Using a purely collaborative approach can provide consistency as people generally stick to the same language or category. However, for videos that have few or no collaborative signals, it gets trickier. This means popular items might start to surface in the recommendation list. By tying the video embedding with the output softmax matrix (as in [7]) we can jointly regularize them and therefore ensure better consistency. We also use a multi-task learning approach for regularization as shown in [8]. By predicting the category of the video the user has watched, we constrain videos from the same category to have close representations in the softmax embedding matrix and also in the video embedding (as those two are tied).

Using the described video recommendation approach has enabled us to improve our statistical and product KPIs by twofold. However, this is just the beginning of our journey to fully satisfy our users. We still have to tackle a lot of issues. This includes:

[1] Session-based Recommendations with Recurrent Neural Networks
[2] Contextual Sequence Modeling for Recommendation with Recurrent Neural Networks
[3] Latent Cross: Making Use of Context in Recurrent Recommender Systems
[4] Deep Learning From Logged Bandit Feedback
[5] Top-K Off-Policy Correction for a REINFORCE Recommender System
[6] The Curious Case of Neural Text Degeneration
[7] Using the Output Embedding to Improve Language Models
[8] Meta-Prod2Vec- Product Embeddings Using Side-Information for Recommendation
[9] Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
[10] Axiomatic Attribution for Deep Networks


The home for videos that matter

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store