Building a Multi-Armed Bandit System from the Ground Up: A Recommendations and Ranking Case Study, Part I

Published in

Udemy Tech Blog

13 min readNov 16, 2022

Introduction

In recent years, multi-armed bandits (MAB) have been enjoying a surge in popularity as explore-exploit balancing approaches continue to show success in a wide variety of applications. One particularly successful application has been the use of multi-armed bandits for recommendations. Introducing exploration into the recommendation approach helps address common recommendations problems, such as feedback loop bias and the cold-start issue. While the success of bandit approaches in industry is clearly very attractive, building a fully-working scalable MAB system from scratch presents a number of significant engineering, operational and data science challenges.

In this two-part blog post, we will share our experience in creating a successful scalable, near real-time MAB system for the purpose of ranking in recommendations at our company, Udemy.

In this part (Part I), we will discuss what bandits are and why to use them for recommendations, and how to translate the ranking problem into a bandit problem. We’ll also walk through the data science challenges that may arise during a practical application of bandits.

Then in Part II, we will provide an overview of our production architecture and development environment, and will discuss the engineering challenges teams may face when deploying a similar system.

What are Multi-Armed Bandits and Why Use Them?

Problem Setting

Multi-armed bandits are a class of explore-exploit algorithms that are used to make decisions under uncertainty. To help conceptually understand what we mean by an “explore-exploit algorithm,” it is best to consider the following problem setting, from which multi-armed bandits received their name.

Image Sources: https://hackintoshrao.com/2017/12/12/the-exploration-exploitation-dilemma-in-multiarm-bandit-problem/; https://i.stack.imgur.com/04Ggq.jpg

The problem setting begins in a casino. You walk in and are presented with a number of slot machines to choose from. Each of these slot machines has an arm (lever) that you can pull, and when you do so, you receive some random reward (payout) sampled from that slot machine’s predetermined reward probability distribution. Note that each slot machine has a different reward probability distribution, and these distributions are unknown to you when you first begin playing.

Your goal is to maximize your winnings from playing these slot machines. Being a smart gambler, you came to the casino with a fixed amount of money you were willing to spend, so you only have a limited number of arm pulls to use. If you knew the true reward probability distributions for each slot machine, then it would be very clear how to maximize your winnings in the long run: simply continuously play the slot machine’s arm whose reward distribution has the highest expected value.

However, you do not know anything about these reward distributions at first, so you must use some of your limited arm pulls to explore and learn more about each arm’s expected reward, while also balancing exploitation of the arms you have seen to be best so far. With only exploration, you would gain high confidence in each of the slot machine’s reward distributions, but would run out of arm pulls before you could profit off knowing the best arm to pull. With only exploitation, you might choose to continuously play a slot machine that is suboptimal, and fail to reap the benefits of the better machines.

Clearly, we face the classic explore-exploit dilemma and need to achieve some type of balance between the two. Multi-armed bandit algorithms are designed for this purpose: to optimally balance exploration and exploitation and help maximize your cumulative reward.

Definition

The standard multi-armed bandit problem setup can be defined more formally as follows:

Bandits are a class of explore-exploit algorithms used to make decisions under uncertainty. They can be thought of as a simplified reinforcement learning algorithm.
Given a set of k arms (options) A = {a₁ , a₂ , …, aₖ}, the goal is to determine the best arm(s) to play to maximize the cumulative reward gain from playing these arms.
When an arm aᵢ is played at time t, we observe a reward rₜ sampled from that arm’s reward probability distribution — no other rewards are observed for unplayed arms. We record this as an observation, which is an arm-reward pair (aᵢᵗ, rᵢᵗ).
We devise a strategy to optimally decide which arm to select next in order to learn more about the true expected rewards for each arm, without showing a suboptimal arm too often. A strategy takes into account the history of arm-reward observations up until time t-1 and uses that history to decide what to pick at time t.

There are many different types of multi-armed bandit strategies, each with their own pros and cons in terms of implementation and regret bounds. For the sake of brevity, in this post we will not go into the mathematical details of the strategies, but simply share some popular strategies at a high conceptual level:

Epsilon-Greedy
- Choose arm with best empirical reward with probability 1-ε
- Choose arm uniformly at random with probability ε
Decayed Epsilon-Greedy
- Perform Epsilon-Greedy strategy, but let ε shrink over time to reduce exploration when we are more confident about our expected reward estimates
Upper Confidence Bound (UCB1)
- Construct confidence intervals for each arm’s expected reward
- Choose the arm whose upper confidence interval bound is the largest
Thompson Sampling
- Take a Bayesian approach and set priors for each arm’s reward distribution
- With each new reward estimate, update posterior distributions
- Sample from each arm’s posterior, and choose arm with the largest sample

As a side note, the applications that we have built at Udemy using our MAB system have primarily utilized the Thompson Sampling strategy.

Using Multi-Armed Bandits for Recommendations

The recommendations domain contains perfect applications for multi-armed bandits. In a general recommendations problem, our goal is to recommend the best items to our users, where “best” is defined according to some type of user feedback. Then the recommendations problem can be framed as a bandit problem as follows:

The set of candidate items to recommend is the set of arms to play (each item is an arm)
Displaying a particular recommended item to a user for a given amount of time is equivalent to “playing that arm”
The user’s feedback when shown the recommended item, such as clicks, thumbs up or enrollments (in the case of recommended courses at Udemy), is the observed reward (i.e., the reward that is sampled from the item’s unknown reward distribution)
The goal is to determine the items with the highest reward payouts as quickly as possible by dynamically recommending various items to the user in a way that balances exploration with exploitation

There are many attractive qualities of multi-armed bandits that make them excellent for recommendations applications. Three primary reasons bandits are particularly good for recommendations are:

Recommendations have opportunity costs.
Every time a recommendation is made to a user, there is an associated opportunity cost; there may have been something better to show. Bandits shine when there is a cost associated with playing a suboptimal arm. They are designed to minimize opportunity costs.

Bandits help overcome the recommendations feedback loop bias problem.

The recommendations feedback loop bias problem arises due to the fact that items which are already recommended have more opportunities for exposure and therefore positive feedback from users. Many traditional supervised machine learning algorithms that use this feedback during training are biased toward recommending these items further, continuing the cycle. Exploration in bandits inherently breaks this cycle — less frequently explored items will need to be tested in the eyes of the bandit, so we are less prone to ending up stuck showing the same items over and over again unless they truly are optimal.

Bandits can naturally address the item cold start problem.
Highly related to the feedback loop bias problem is the cold start problem: new items that have very little feedback will be less likely to be recommended since they have little to no positive signals associated with them. Bandits give a fair shot to all candidate arms by naturally exploring a newer item until they are confident enough in the expected reward associated with it, at which point they will either exploit this item if they have deemed it to be successful, or stop showing it if it is deemed a failure.

How to Frame a Ranking Problem as a Bandit Problem

There are several resources available online detailing how to set up a traditional bandit problem in industry, where a single arm is presented to the user at a time and the goal is to find the best arm for a single position (e.g., banner optimization, thumbnail optimization, button color optimization and so on.). However, few of these resources provide details on how to frame a recommendations ranking problem as a bandit problem. The key difference here is that in a ranking problem, there are multiple positions to consider, and therefore we can technically collect rewards for multiple arms at a time. In this section, you’ll read about a case study performed at Udemy, which showcases how you can set up the ranking problem as a bandit problem.

One common application of ranking at Udemy is that of recommendation unit ranking. A recommendation unit is a horizontal carousel of courses belonging to some type of group (e.g., “Because you Viewed”, “Recommended for You”, “Popular for ____”, etc.). The order that these recommendation units appear on a user’s logged-in homepage can be very important from both a user perspective and a business perspective.

We frame the recommendation unit ranking problem as follows:

Each recommendation unit type is a candidate arm
Playing an arm means to display a recommendation unit to a user in a given position for a fixed amount of time (15 minutes in our case)
- Each user will generate an observation (reward) based on the feedback provided on the shown unit. Note that an observation only begins once a unit impression is made
The reward is the metric we wish to use to compare various units’ performance in a given position
- We use a combination of user clicks and enrollments for courses in the unit, in the fixed amount of time

All of this seems simple enough. However, we have yet to elaborate on how to treat feedback in the multiple positions of the ranking. In the traditional bandit setup, only a single arm is shown. To take advantage of the fact that multiple arms can be pulled at once, we need to modify the traditional bandit setting. For this, we consider two possible frameworks: the per position framework and the slate bandit framework.

Per Position Framework

In the per position framework, we consider using a separate bandit instance for each position of the ranking. The goal of each bandit is to find the optimal arm to play in its assigned position (e.g., Bandit₂ only considers an arm “played” if a unit is shown to a user in position 2, and only considers rewards generated by clicks and enrollments in position 2).

There are both advantages and disadvantages to this approach. On the advantages side, we are able to achieve a much more granular, specific ranking optimization because the bandits are learning the best units to display at an exact location. This is more in line with what we think of when tasked with finding an optimal overall ranking.

However, there are some practical challenges that arise when using the per position framework:

There are inherent interdependencies between each of the bandit instances in each position. For example, Bandit₂ might wish to show the “Because you Viewed” unit at a particular time, but if Bandit₁ has already chosen that same unit, then Bandit₂ must choose something else to avoid duplication on the user side. You can imagine that these interdependencies become more and more complicated as we continue down the ranking and extend this to a greater number of positions.
You must maintain several different bandit instances at a time. These are streaming applications with dependencies on each other; failure of one bandit instance could cause issues for the other bandit instances.
Reward feedback cannot be shared between bandits easily. Because the definition of a reward for the “Because you Viewed” unit in the eyes of Bandit₂ is clicks and enrollments that occur when that unit is shown in position 2, user clicks and enrollments that occur on the “Because you Viewed” unit in position 3 cannot be shared with Bandit₂. This increases the time for the bandit to learn and for the ranking to converge (we will talk more about convergence later!).

Because of the multiple practical challenges that arise when using the per position framework, we instead chose to use the slate bandit framework.

Slate Bandit Framework

In the slate bandit framework, we forgo using a separate bandit instance per position and instead use a single bandit that tests multiple arms at one time. In this setup, we define “playing an arm” to be displaying a unit in any of the top k positions. The selection of the value for k is important; feedback for arms shown in one of the top k positions is all treated equally, so a click or enrollment in the first position is no different than a click or enrollment in the kᵗʰ position in the eyes of the bandit. Generally speaking, the value chosen for k is a design decision, and will vary depending on the use case and page layout. For our particular use case, we chose k = 3 because a user’s logged-in homepage usually shows three full units in the window when the screen is maximized, so an argument can be made that feedback for units in these positions is generally comparable.

Note that under this framework, the bandit is no longer trying to find the exact best units in each position for the ranking, but is now answering the less specific question, “What are the best units to show towards the top of the page?”

Disadvantages of this reduced granularity include the following:

The ranking within the top k positions is not optimized — the bandit treats the best arms as an unordered set since feedback among the top k is indistinguishable
The ranking outside the top k positions is induced and not directly learned — the unit that ends up in position k+1 is placed there because it is what the bandit believes is the next best unit to show somewhere in the top k positions if it could, not because the bandit believes position k+1 specifically is optimal

However, the slate bandit framework comes with numerous advantages as well:

Easier implementation and maintenance — just a single bandit with no complications stemming from interdependencies between multiple bandits
More reward feedback per bandit instance — rewards for all of the arms shown in the top k positions are used by the bandit, so we receive more feedback per user for the bandit than in the per position framework. This enables the bandit to learn and converge much more quickly

We implemented the slate bandit framework for our unit ranking application, and observed strong revenue and engagement lifts during our A/B test against our existing non-bandit unit ranking algorithm. This suggests that the simpler slate bandit framework can still be highly effective, while simultaneously reducing complexity.

Data Science Challenges

Achieving / Measuring Convergence

One common data science challenge when running a multi-armed bandit system is that of achieving convergence. In the stationary bandit setting (where the true reward distributions per arm do not change drastically over time), we expect our bandit system to converge onto the optimal arm(s) (i.e., reduce exploration and mostly perform exploitation). In general, the speed at which a multi-armed bandit algorithm can converge is dependent on several factors. These are summarized in the table below.

Defining and measuring convergence can be done in many different ways. For our slate bandit ranking system, we developed a metric that looks at the consecutive rankings outputted by the bandit and tracks the average proportion of arm set changes in the top k positions for a window of time. Conceptually, this metric can be thought of as a “rate of change” metric.

If our bandit is successfully converging, then the rate of change should gradually decrease towards zero (with some natural fluctuations that occur due to the inherent exploration that is almost always possible with most bandit strategies). Having graphs to measure convergence is vital for ensuring that the bandit system is working as expected. An example of this can be seen below:

The convergence graph above uses a window size of 60 and a step size of 60, and each update occurs every minute. This means that the graph shows the average rate of arm set changes in consecutive rankings for each hour of the day.

We have generally seen that successful A/B test results with our bandit system correlate highly with successful, quick convergence. When choosing your application and deciding whether bandits are a good fit, keep in mind all of the factors that can affect convergence.

Offline Evaluation

Another data science challenge that arises when dealing with bandit systems is performing offline evaluation. Offline evaluation is useful for determining which explore-exploit strategy to test online in an A/B test, and to shorten the time needed to iterate and improve.

Multi-armed bandits are notoriously difficult to evaluate offline because they are inherently online, interactive algorithms. The sequence of decisions selected by the bandit that we wish to evaluate is dependent on the real user reward feedback obtained. Using previously collected historical data is often not sufficient, as the arms shown in the past might not match up with what the bandit strategy selected.

To address this problem, one can use the replay method. In the replay method, we perform a short random data collection phase, where candidate arms are shown to the user at random and the associated rewards are collected. This creates a stream of unbiased historical arm-reward pairs. Then, to evaluate a given bandit strategy, we simply sequentially step through the stream of historical arm-reward pairs and keep the pairs that match up with what the bandit wanted to select at that time. The cumulative reward from these matched pairs gives us a measure of how successful a particular bandit strategy is. Please see the figure below for a visualization of how this is performed.

Conclusion

In this part, we discussed what bandits are, how to use them in a recommendations ranking application and the types of common data science challenges that arise when successfully using bandits in industry. In the next part, we will dive into our system architecture that makes this all possible in near real-time with millions of users!