SlateQ: A scalable algorithm for slate recommendation problems

Reinforcement Learning in Recommender Systems

Shishir Kumar
Analytics Vidhya
7 min readJul 31, 2020

--

I was recently introduced to the wonderful world of Reinforcement Learning (RL) and wanted to explore its applications in recommender systems. Craig Boutilier’s talk on challenges of RL in recommender systems piqued my interest. In this article, I explore Google’s SlateQ algorithm and use it to train an agent in a simulated environment using its RecSim [Github] library.

Key challenges in slate recommender problems

Slate recommender problems are very common in online services like Youtube, News, E-commerce etc, where a user chooses from a ranked list of recommended items. These recommended items are selected from an often large collection of items and ranked basis user behaviour and interests. The objective is to maximise user engagement by offering personalised content that best aligns with a user’s interests and likes.

A recommender agent, in this case, would select an ordered slate of size k from a corpus of items I such that user engagement is maximised. This problem is three-pronged:

  1. For |I| = 20 and k = 10, an agent has \binom{20}{10} x 10! = 670 Billion possible slates to choose from (i.e. agent’s action space). Since this action space scales in the factorial of k, it is referred to as the problem of combinatorial action space.
  2. An agent that maximises user engagement in the long run ( e.g. total session time, long term value etc) is desirable. Myopic agents that optimise only for the short term can end up hurting long term engagement of a user.
  3. We require an efficient and scalable algorithm that enables real-time recommendation for a large number of users.

Algorithms like collaborative filtering, matrix factorisation etc. are widely used in practical recommender algorithms. However, they fail to address one or more of the above challenges. SlateQ uses the power of Reinforcement Learning to overcome all these challenges.

MDPs to the rescue

MDP formulation in SlateQ

SlateQ models the slate recommendation problem as a Markov Decision Process (MDP) where:

  1. the State S represents the state of the user. The user state captures information like user’s history, demographics, context (e.g. time of day) etc.
  2. the Action space A is simply the set of all possible recommendation slates.
  3. Transition Probability P(s’|s, A) reflects the probability that the user transitions to state s’ when slate A is presented to the user in state s.
  4. Rewards R(s, A) measures the expected degree of user engagement with items on the slate A.

Assumptions that simplify the problem

  1. For a given slate of items, SlateQ assumes that a user can select at max one item i at a time. This is a reasonable assumption in settings like Youtube. Note that this assumption also allows a user to get away without choosing any item.
  2. Returning to the slate for a second item is modelled and logged as a separate event. The user’s state is updated with each consumed item, which in turn is used to recommend a new slate to the returning user.
  3. State Transitions and Rewards depend only on the selected item, i.e. they are independent of other items in the slate recommended to the user. This is a reasonable assumption since non-consumed items have significantly less impact than consumed items on the user’s behaviour.
  4. The user choice model is known i.e. P(s’|s, i) is known. Conditional choice models like the multinomial logit can be used to model user choice.

These assumptions enable us to break down the rewards and transition probabilities in terms of its component items.

#1: Single Choice Assumption
#3: Reward/transition dependence on selection (RTDS) assumption

Q-learning breakdown

The Q in SlateQ refers to Q-learning, an RL algorithm that finds the optimal slate by assigning values (called Q-values) to all state and action pairs. These Q-values represent the Long Term Value (LTV) of a user consuming an item i from slate A.

SlateQ breaks down the LTV of a slate of items as the expected sum of LTVs of its component items i.e. Q_bar(s, i). The probabilities P(i|s, A) are generated using user choice models discussed earlier.

Item-wise breakdown of Q-values for a given slate A

Temporal Difference (TD) learning is then used to learn the item-wise LTVs Q_bar(s, i). Given a consumed item i at state s with observed reward r, the agent transitions to new state s’ and recommends new slate A’ such that user LTV is maximised. Below equation lays down the update to Q-values. Discount rate gamma and learning rate alpha regulate the importance placed on future actions.

TD update for item-wise Q-values

SlateQ uses a Deep Q Network (DQN) with experience replay to learn these Q-values.

Slate Optimisation

After learning the item-wise LTVs, the algorithm selects a slate A of k items from corpus I such that the total expected LTV of the user is maximised. More formally, this can be written as:

LTV Optimisation Problem

SlateQ uses mixed-integer programming methods to find the exact solution of the above optimisation problem. Exact optimisation takes polynomial time in the number of items in corpus I and can be used for offline training. While serving real-time recommendations, approximation methods that are less computationally expensive are used to find the optimal slate.

Now that we have a good understanding of the SlateQ algorithm, let’s see how well it works by running a small experiment. RecSim library has an implementation of the algorithm along with various simulated environments to train agents and compare algorithmic performance.

RecSim’s Interest Evolution Environment

Interest Evolution environment is a collection of models that simulate sequential user interaction in a slate recommender setting. It has 3 key components:

  1. The document model samples items from a prior over document features, which incorporates latent features such as document quality, and observable features such as document topic.
  2. The user model samples users from a prior distribution over user features. These features include latent features such as interests and behavioural features such as time budget.
  3. For a recommended slate, the user choice model records the user response by selecting at max 1 item. The choice of the item by the user depends on observable document features (e.g., topic, perceived appeal) and all user features (e.g., interests).

Each user starts with a fixed time budget. Each consumed item reduces the time budget by a fixed amount. Besides, this time budget is replenished depending upon how appealing the item is to the user. Such a configuration places importance on long term behaviour of the user and hence enables us to train non-myopic agents.

Single User Control flow in RecSim

Check out this notebook for more details on environment and generated data.

Experiment Setup and Results

In our experiment, the agent chooses 2 documents to recommend from a corpus of 10 candidate documents. The candidate documents are selected by the document model from 20 different topics. The size of the slate is intentionally kept small for faster training and evaluation.

RecSim comes with a host of algorithms to train an agent. The slate_decomp_q_agent function implements the SlateQ algorithm. I am using the full_slate_q_agent function as the baseline agent, which is a DQN without SlateQ’s decomposition. Check out this notebook for more details.

Click-Through Rates(CTR) at different time steps

We see that the Click Through Rate (CTR) is 0.5 for SlateQ agent vs 0.48 for Full Slate Q-learning agent. Similar trend is seen while comparing the average rewards ( i.e. clicks) per episode metric. This shows that SlateQ agent outperforms the baseline agent for our setup.

Average Rewards per Episode at different time steps

Note: Please use the latest version of RecSim library. I encountered this issue which has recently been fixed by the administrator. Also, you can use this requirements file to create an environment with the required dependencies.

Takeaways

SlateQ breaks down the LTV of a slate into its component-wise LTVs and maximises long term user engagement. Such a decomposition uses items as action inputs rather than slates, making it more generalisable and data-efficient. SlateQ also enables real-time recommendations by using approximation methods to solve the slate optimisation problem.

Besides, SlateQ can be implemented on top of existing myopic recommender system already in place. Youtube ran a live experiment comparing the performance of SlateQ vs their highly optimised recommender system in production (i.e. control). Below chart shows SlateQ consistently and significantly outperformed the existing system by improving user engagement.

Increase in user engagement for SlateQ over baseline algorithm

Next Steps

  1. Explore slate optimisation methods in more detail.
  2. Explore the implementation details of DQN in SlateQ algorithm.

Code and other relevant materials can be found in this Github repo. A recording of our class presentation can be found here.

Special shoutout to my USF MSDS classmate Collin Prather for collaborating with me on this project. I also thank Prof. Brian Spiering for his awesome introductory course on RL at USF.

--

--