Takeaways from Netflix’s Personalization Workshop 2018

For the third time Netflix organized its Personalization, Recommendation and Search Workshop. It was awesome to get invited for this event during my tech holiday in the San Francisco Bay Area. The experienced data scientists from all over Silicon Valley and beyond made it a knowledge-rich day. With detailed presentations speakers from i.a. Google, Microsoft, Netflix, Spotify, and University of Minnesota shared how to understand and serve your users better. There was one subject that all speakers agreed on: classic matrix factorization (collaborative filtering) reached its expiration date. This blog captures my takeaways on their different approaches for its successor. This includes challenges of multi-armed bandits, an implicit feedback approach, top-N ranking techniques, tyranny of the majority and algorithmic bias.

Personalization at Netflix

At Netflix almost your whole homepage is personalized: the banner, carousels, order, artwork, text and search. That is why they state that a good recommender system considers: what, how, when and where a title is recommended. Their goal for this personalization is: “Help members to find content to watch and enjoy to maximize satisfaction and retention.” Expressed as:

Personalization = Maximize enjoyment + Minimize search time
Netflix’s personalized homepage (Kawale & Amat, 2018)

Multi-armed bandits vs. classic matrix factorization

Within this domain Jaya Kawale and Fernando Amat (Netflix) shared two case studies: artwork optimization and billboard selection. The artwork optimization was earlier extensively described in at their techblog. The research on the billboard recommendation was new. Both aim to determine an incremental effect within an unknown reward distribution. This requires a multi-armed bandit solution as traditional machine learning can’t model this effect. Jaya Kawale argued that there are five aspects that classic matrix factorization is unable to handle: time sensitivity, scarce feedback, dynamic catalogue, non-stationary member base and country availability. That is why she emphasized that new methods must enable continuous and fast learning, like multi-armed bandits. Recap: A one-arm-bandit is a slot machine that takes your money. A multi-armed bandit problem is collection of choices (slot machines) that compete for limited resources (money) to maximize the cumulative reward after requests, without fully knowing these choices properties at the start.

Greedy exploit & Incrementality based policies

Netflix started optimizing their artwork by creating multiple images for each title. With the goal: “Recommend a personalized artwork or imagery for a title to help members decide if they will enjoy the title or not.” A good image has 4 characteristics: representative (no clickbait), engaging, informative and differential. Multi-armed bandits only profits if the movie has multiple images. To enable the designers of Netflix to create these volumes of images, a different team came up with an algorithm to provide suggestions. It is the ambition to fully automate this process. The objective of the artwork optimization is to determine the incremental effect of the image in the recommendation carousel. Experimenting with the artwork is not without challenges, as changing images can be confusing. This was included in the exploration-exploitation tradeoff. They chose the Greedy exploit policy with a regular and a contextual bandit. The offline test with Replay confirmed that both bandits performed better than the random baseline. The highest uplift of the contextual bandit proofs that context matters here. Their online test on 125 million users showed that the artwork optimization is most beneficial for less known titles.

Contextual themes in artwork (Kawale & Amat, 2018)

The billboard on the Netflix homepage is the most prominent position to promote a title. The goal of this case study is clear and simple: “Successfully introduce content to the right member.” The solution is more complex for two reasons. First, the Netflix homepage is considered expensive ‘real-estate’. It can boost a title, however with many alternatives and unknown benefits the opportunity cost are high. Second, titles can be shown on multiple positions. For example, the popular titles are also shown in the ‘Trending now’ carousel. For these reasons, next to the Greedy exploit policy a new Incrementality based policy is tested. The objective of this policy is not simply the probability of play. The model should also consider if the user would have played the title anyway, the incremental effect (∆Pplay). This probability of play is determined based on the user, the candidate pool of titles/images and its features. The aim is to recommend the title which has the largest additional benefit from the billboard. The offline test with Replay showed that the incrementality based policy had a lower lift than the greedy policy, however the difference is minimal. The online test shows that with this policy Netflix is able to shift user engagement from popular to lesser known title by promoting it on the billboard. The scatter plot visualizes that title A benefits more of the billboard than title C. Accordingly, from a probability of play perspective the precious billboard is better utilized with this bandit policy.

Incremental vs. baseline probability of play (Kawale & Amat, 2018)

One of the future research directions for multi-armed bandits at Netflix is to use the objective of user retention (enjoyment) instead of click-through rate (CTR). In the long term their ambition is to scale this approach by adding more and more choices to multi-armed bandit problem. With as objective to create a website or app that fully personalized: both in content and design. Jaya Kawale and Fernando Amat shared impressive work in their presentation (Slides). RTL Netherlands is also researching multi-armed bandits for the personalization of our i.a. RTL News and Videoland platforms. Essential for successfully testing several types of multi-armed bandits and policies is a generic framework. Netflix’s ‘plugin framework’ with its closed-loop system was only briefly discussed. This closed-loop ensures that the data on the provided recommendations and the corresponding behavior of the users is captured to further improve the quality of these online recommendations. In this interesting related presentation Elliot Chow tells more about it.

Implicit feedback perspective by Microsoft

Jaya Kawale mentioned the necessity of unbiased training data for multi-armed bandits. Adith Swaminathan (Microsoft) worked on a interesting enhancement of batch learning from bandit feedback to improve this. Together with Thorsten Joachims and Maarten de Rijke he successfully created a new output layer for deep neural network to use logged contextual bandit feedback for training. Collecting this valuable feedback, for example recommender system logs, is easy compared to collecting supervised data. However this data often contains a selection bias. To illustrate the risk of this bias, Adith Swaminathan shared the example of survivorship bias with World War II planes. The damage of returning planes was studied by the Center for Naval Analyses to minimize the plane losses during next missions. The most obvious approach was adding extra armor to the areas with the most bullet holes. Statistician Abraham Wald (1943) disagreed and argued the opposite approach. He pointed out that these returning planes survived their missions, which meant that the bullet holes were in the non-critical areas. Successful strengthening the planes required that they had to focus at the areas with little damage. This survivorship bias became an well-known example of the risk of ignoring absent information.

Survivorship bias (Wikipedia)

With its actions of a platform, a user also shares confounding signals. When it is just browsing a platform or follows recommendation of advertising incentives. Because this feedback data is biased the model should not aim for the minimization of Root Mean Squared Error. Both Adith Swaminathan (Microsoft) and Anoop Deoras and Dawen Liang (Netflix) emphasized to no longer focus on minimizing the Root Mean Squared Error (RMSE) for generating predictions. It is misleading. Yes it is easy, but it neglects the value of open spaces. Matrix factorization throws away the fact that a user didn’t watched an item for a reason. Data is not missing at random. It can happen that title is not available in a region, nevertheless this doesn’t mean it’s not popular. He argued that the focus should shift to an implicit feedback perspective.

To include this implicit feedback, the new objective should be counterfactual risk minimization. This means that the standard variance-optimal estimator needs to be replaced by empirical risk estimator (with variance regularization). Adith chose a self-normalized inverse propensity scoring estimator. This estimator is decomposed and reformulated to enable stochastic gradient descent training. This combination, dubbed BanditNet, enables effective training of deep neural networks with unprecedented data volumes. The BanditNet approach is tested with adoption of the ResNet20 architecture. In their research experiment of visual object recognition they show that with enough feedback BanditNet has a lower error rate than a conventional full-information training. Accordingly, with his research Adith further supports the growth of multi-armed bandit algorithms, enabling selecting the right action at the right time (Slides, Paper, Resources).

Learning curve of BanditNet (Joachims, Swaminathan & de Rijke, 2018)

Two top-N ranking techniques

Two presentation shared solutions to improve the quality of top-N recommender systems. A top-N recommender system generates a list of ranked items of which a user is likely interested in. The central problem for this type of recommendation is: How to rank the relevant items higher? The challenge is to improve the quality while being scalable. Evangelia Christakopoulou (University of Minnesota) chose a linear method with a special focus on similar users’ behavior. Anoop Deoras and Dawen Liang (Netflix) studied the opportunities of deep latent models with variational autoencoders.

Global-local approach by the University of Minnesota

Evangelia Christakopoulou (University of Minnesota) recognized the fine-grain detail between user groups with a dual approach. She prefers methods with user-item implicit feedback data. Next to a global item-item model with aspects shared by all users, she created local item-item models for each user subset. User are allowed to switch subsets. She experimented with 3 variants:

  • Pure Singular-value decomposition (SVD)
  • Global Local SVD with varying Subsets (sGLSVD)
  • Global Local SVD with varying Ranks (rGLSVD)

These variants are tested on four datasets: Transactions of a grocery store, MovieLens 10M dataset, subset of the Flixter dataset, and subset of the Netflix Prize dataset. The performance is evaluated on the hit-rate (HR). This is number of users whose item in the test is present in the size-N recommendation list, divided by the total number of users. The global-local approach outperformed the classic SVD approach for all datasets (Slides, Resources & Related paper).

Results from global-local model (Christakopoulou & Karypis, 2018)

Next to our online activities, RTL Netherlands runs eight television channels. Each of these channels broadcasts content for different, partly overlapping target audiences. While our digital platforms provide a mainly uniform experience to these same audiences. These target audiences can be seen as subsets with shared and specific interests. Applying this global-local method to for example our digital platforms can provide a more aligned distribution strategy, before diversifying even further.

Variational autoencoders by Netflix

Also, Anoop Deoras and Dawen Liang of Netflix believe that there should be more eye for detail. They warned that recommendations aren’t a big data problem, but a small data problem. Users only interact with a small proportion of the titles. Hence, the models should focus on efficient understanding the sparse signals that a user shares. They presented the evolution of latent models at Netflix by described how they moved from shallow to deep latent models. Their aim was to take into account the observed and missing entries in a user-item matrix. This implicit feedback is too valuable to neglect. However, the limited modelling capacity of shallow models like matrix factorization and Latent Dirichlet Allocation results into inferior prediction power. I was impressed by Dawen Liang’s research into deep neural networks with variational autoencoders (VAEs).

Dawen Liang pointed out that multinomial likelihood with latent-factor models seems to be little studied for collaborative filtering, while the nonlinearity aspect can enable richer recommendations (Paper, Related talk). This approach was applied in the Next Play case study at Netflix. Next Play is the recommended title to watch next at the end of a watched title. The goal of this personalization element is: “Maximizing the likelihood of a user playing the next play directly.” He extends VAEs to collaborative filtering for implicit feedback. A regularization parameter was applied to the learning objective to ensure a better performance. Their research shows that the encoding of rich nonlinear user item interaction can indeed result in superior prediction power. A principled Bayesian approach can even perform better. As a top-N recommender system is a small data problem, it is very suitable for Bayesian inference (Slides). While I first have to further study his research, I believe that this can be a promising method to replace some of our current collaborative filtering algorithms. Especially on our rapidly growing video-on-demand platform, where users also interact with only a small number of titles. Evidently, it can be valuable to enhance our recommendations by including their implicit feedback with a improved latent model. This would support our users in discovering the diverse range of domestic and international content available.

As a follow-up Yves Raimond (organizer & Netflix) referred to his RecSys presentation about “The Importance of Time and Causality in Recommender Systems”. Indeed worth reading on this subject.

Tyranny of the majority by Google

Google’s Ed Chi emphasized to go beyond being accurate. He agreed with Evangelia Christakopoulou that global optimal models cannot serve the diversity of all users. It enforces the tyranny of the majority. He warned that the behavior of frequent users is an important threat for the quality of recommender systems. The activity of these users influences some models too much. That is why his goal is: “A model that predicts well for all users and all items.” He shared two approaches:

  • Focused learning for the long-tail. This method is a combination of hyperparameter optimization and a customized matrix factorization objective (Paper).
  • Adversarial training for fairer models. He describes the impact that removing data has on the resulting model and he concludes: “a small amount of data is needed to train these adversarial models, and the data distribution empirically drives the adversary’s notion of fairness” (Paper).

Spotify’s 3 Pragmatic lessons to prevent algorithmic bias

Henriette Cramer of Spotify shared 3 pragmatic lessons learnt while teaching machines:

  1. Human decisions do affect machine learning outcomes, aka algorithmic bias.
  2. Translate complex research areas into minimal viable steps.
  3. Complex models are not always the solution.

During a conversation Henriette Cramer, also Dutch, and I observed that there have been limited attention for algorithmic bias in the Netherlands. Not a total odd situation, as algorithmic bias is a subject that receives increasing attention within the industry. Ed Chi agreed with Henriette Cramer on the importance of addressing algorithmic bias and related risks. He pointed out that that Sundar Pichai just published Google’s AI principles.

The three lessons were illustrated by sharing an example of Spotify’s voice interface. At Spotify they discovered that their voice interface had difficulties in identifying and correcting inaccessible content. The interface was unable to comprehend all requests. Especially abbreviations, non-English and code-switching lead to wrong song suggestions. For example the very popular track Prblms by 6LACK means double trouble. As “Prblms” is pronounced as “problems”, and “6LACK” as “black”. Research shows that hip-hop and country have more specific linguistic practices than other genres, resulting in more anomalous tracks. The anomalous tracks were inaccessible content for the voice interface. A clear example of algorithmic bias.

Genre representation in full and anomalous track sets (Springer & Cramer, 2018)

As there was not a standard tool available to prevent this algorithmic bias, Spotify designed a new solution consisting of 3 steps. First, a generalizable method to identify content that underserved by the voice interface. Second, topology of linguistic practices of underserved content. Third, annotate underserved content with CrowdFlower to improve the accessibility. A test showed that this approach with aliases for the artist and titles significantly improves the accessibility of this underserved content (Paper & Related talk).

Henriette Cramer stated that music is emotion. Very recognizable, because at RTL Netherlands I observe the same for television. She visualised the music tastes differs through the seasons with this graph. Which also shows that users’ interests are not static, but dynamic. A similar observation is made by Ed Chi when he mentioned that recommender systems are not static, but dynamic accuracy problem.

A year in music consumption (Park, Thom, Cramer, Mennicken & Macy)

Final thoughts

At last, a big shout-out to Netflix for organizing this terrific workshop at their headquarters. Bringing together this bright community to share the stage and conversations, was simply addictive. If these notes made you curious about the details, I encourage you to read their comprehensive papers, check the available slides or explore the other resources that I link to. The number and the extent of these commercial research projects proves that personalization is not a commodity. It is a core competency that enables companies to serve each user a tailor-made experience. Users desire personalized services that maximize their enjoyment and minimize annoying search time. The workshop also reaffirmed that my colleagues and I are on the right data path. We use the same technology and are researching similar algorithms. However, we aren’t as experienced with testing these successors of matrix factorization in production. Thanks to all the presentations and discussions I returned to the Netherlands with new ideas to pursue. And after receiving many questions where RTL’s office is in the Valley, I have to discuss this with my CEO for sure.