Dailymotion is a video platform that hosts hundreds of millions of videos which are watched every day by millions of users. Given the size of our catalog, being able to select automatically the best videos to recommend to our users is crucial in order to engage them and drive our growth. One of the most crucial spaces where recommendation plays a great role is the “video page” where the main video is followed by a selection of related videos. We call this algorithm “video to video”.
In the past, we relied heavily on external providers that provided our entire “video to video” stack. Although they were producing fairly relevant recommendations, we felt that we could do better on two specific points:
- Improve the recommendation algorithm (and the traditional ranking metrics we are monitoring such as recall@N) by using state of the art approaches which could better integrate new features and understand better the signals we get from our users.
- Better integrate our product vision and our KPIs within the algorithm (especially for the learning/optimization phase) in order to be perfectly aligned with the product principles defined by our product team.
So how did we build our new internal algorithm and increase both traditional ranking metrics while boosting our main product KPIs?
State of the art algorithms
In the past couple of years, we have seen a big change in the recommendation domain which shifted from traditional matrix factorization algorithms (c.f. Netflix Prize in 2009) to state-of-the-art deep learning based methods.
Here are also some of the reasons which made us choose deep learning techniques for our recommendation stack:
- The signals we get from our users (such as views) are not independently distributed observations but can be represented as sequences of actions. Understanding and modeling efficiently these sequences using recurrent neural networks (RNN) was key to improving the accuracy of our recommender system (as in )
- Our videos are often characterized by features (category of the video, name of the channel) which can be used to derive similarities between videos. Moreover, the context of the watch (device, country, …) is crucial in order to tailor the recommendation. Using them as features in a deep learning model has enabled faster convergence but also helped in the cold-start regime when no user signal is available for a given video (as shown in  or )
- We only observe feedback (watches in our case) on a given video when it has been shown to the users (bandit feedback). As a consequence, we do not know what would have happened if we had selected other videos for a given user (counterfactual reasoning). Learning in this type of setting requires special paradigms such as off-policy learning or counterfactual learning which have been used a lot in reinforcement learning for example. Recently, several works have been studying “deep learning” based models in these settings. They have especially focused on propensity based methods to remove the biases in the training dataset (as in ). We started with a simple frequentist approach of propensity estimation and then improved it using the formalism described in . As they did, we used a multi-task approach to jointly learn the recommendation model and a propensity estimator.
Let’s see how these ideas are connected:
This algorithm is designed to predict the next watch the user will make based on their recent watch history (video id + metadata) and the watching context (e.g. device and country). At the same time, it learns the probability of all the videos being displayed given the user’s recent history and its context. It enables us to get an accurate estimation of the propensity (probability of being displayed) of the video we are trying to predict.
Using propensity re-weighting, the loss for an element of the left branch of the neural network (which tries to predict the next video) is defined as :
which is simply a weighted (by the inverse-propensity) cross-entropy loss over a softmax.
At serving time, in order to increase the diversity of the recommended videos (and collect feedback from our users on more videos), we use a stochastic version of the softmax. Indeed, instead of taking a greedy approach (selecting the top_k videos with the highest scores), we select the highest probability videos so that their cumulative probability mass is higher than a threshold (0.9 for example). We can then form a new distribution over these videos by rescaling their probability and sample from it. This is called nucleus sampling  and enables us to control the amount of exploration in our algorithm (when it is very confident the nucleus distribution will contain very few videos while it will have a long tail when the algorithm is not confident in its outputs).
Meeting the product principles
Quite often machine learning practitioners think of recommender systems (or any machine learning algorithms integrated into a product environment) only in term of statistical performance using traditional metrics (recall@k, map@k, NDCG@k, …). Meeting the product requirements often comes in second and is only achieved by applying hard rules to the output of the machine learning based recommender systems. This is the worst situation and exactly what should be avoided as it creates a mismatch between what the model is learning and what it outputs after applying the hard rules.
Here are two examples of the product rules we have integrated into our algorithm:
Rule 1: Promote partners which upload premium contents
At Dailymotion, most of our partners deliver high-quality content which qualifies them as Premium. As our strategy is now focused on premium content and because we noticed that great videos receive more attention, it is better to favor them in the recommendation algorithm.
As mentioned before, our algorithm is trained to predict the next video the users will want to watch given everything we know they have watched before.
Using a bandit/RL formalism, we can introduce a reward for each observation and customize it given the type of views the user has made. The loss we introduced before then becomes:
Setting the reward (R) to a higher value for views made on high-quality videos will result in favoring them at training time while deteriorating the relevance (recall@N) of the recommendation. This is what is shown in the following graph representing an offline experiment:
The goal here was to find a tradeoff between relevancy (recall@N) and the percentage of premium videos surfaced in the recommendation lists.
Rule 2: Consistency between the input video and the recommended ones
Consistency is one of the criteria that we would also like to achieve. For instance, if a user starts watching a video in English we would like to surface more videos in the same language. If a user starts to engage with a video from a given category (football for example), the following videos should follow the lead.
Using a purely collaborative approach can provide consistency as people generally stick to the same language or category. However, for videos that have few or no collaborative signals, it gets trickier. This means popular items might start to surface in the recommendation list. By tying the video embedding with the output softmax matrix (as in ) we can jointly regularize them and therefore ensure better consistency. We also use a multi-task learning approach for regularization as shown in . By predicting the category of the video the user has watched, we constrain videos from the same category to have close representations in the softmax embedding matrix and also in the video embedding (as those two are tied).
Using the described video recommendation approach has enabled us to improve our statistical and product KPIs by twofold. However, this is just the beginning of our journey to fully satisfy our users. We still have to tackle a lot of issues. This includes:
- Not always recommending videos that give the highest immediate reward as they often do not provide long term user satisfaction (clickbait videos for example). Working on notions such as “incrementality” or modeling long term rewards will be key to achieve this.
- Modeling the sequence of user actions and especially long term dependencies. Using more efficient models such as Transformers, which is now avant-garde in language modeling , can be beneficial.
- Explaining the recommendations produced by the algorithm (why these specific videos have been chosen for a given user). Combining powerful models such as the ones introduced above while being able to accurately explain their predictions is still a research challenge. Working on methods such as integrated gradients  can potentially be an answer.
 Session-based Recommendations with Recurrent Neural Networks
 Contextual Sequence Modeling for Recommendation with Recurrent Neural Networks
 Latent Cross: Making Use of Context in Recurrent Recommender Systems
 Deep Learning From Logged Bandit Feedback
 Top-K Off-Policy Correction for a REINFORCE Recommender System
 The Curious Case of Neural Text Degeneration
 Using the Output Embedding to Improve Language Models
 Meta-Prod2Vec- Product Embeddings Using Side-Information for Recommendation
 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
 Axiomatic Attribution for Deep Networks