# Multi-Armed Bandits 101

## by Sowmi Chakravarthi

*The *Opex 101* series is an introduction to the tools, trends, and techniques that will help you make the most of your organization’s data. Intended for business leaders and practitioners alike, these posts can help guide your analytics journey by demystifying essential topics in data science and operations research.*

No matter how much math you took in school, you probably heard that famous statistics adage that “correlation does not imply causation.” This phrase is often used to to temper the expectations of enthusiastic data scientists, eager to make far-reaching conclusions after seeing preliminary results of a newly created machine learning model. But controlled experiments are one of the few techniques available to data scientists that can be used to imply causation. Whether evaluating the effectiveness of a machine learning model or simply testing a new website layout, experiments provide the basis for meaningful, statistically supported causal inference, which is critical for data scientists on the hunt for insights.

In this blog, I’ll talk about the nature of experiments and a relatively recently embraced method for conducting them: multi-armed bandits.

**Introduction**

**Experiments**

While many stock examples of experiments have to do with web design, data scientists can take advantage of experiments for other purposes. A prime use case for experiments in data science involves using them to to test machine learning models.

Why, you ask? Well, deploying machine learning models at scale is resource-intensive and time-consuming. For example, if you have a model that’s designed to present an online shopper with recommended products based on their purchase history (supplanting a less data-driven methodology, perhaps), it might be wise to first test its effectiveness on a small portion of the user base via an experiment. This trial run can help assess how well the model actually performs when fed to a small (but representative) subset of the entire population. If things go poorly, you can go back to the drawing board and hold off on the infrastructure and DevOps effort, then make any necessary adjustments before wide release.

**A/B Tests**

The classic experimental approach used in industry is the A/B test. An A/B test is a randomized experiment with two variants (the titular A and B) of something, and the goal is to determine the better option.

A practical example of an A/B test might take two variants of an existing product, like a website or an app, and show each variant to half of the user traffic. After some period of time, statistical methods (e.g., t-tests, chi-square tests) are used to determine if one variant outperformed the other in click-through rate (or a similar success metric) with sufficient significance.

A/B testing is often used in web design, but can be valuable in other fields as well. For example, in retail, they can be used to determine which of two coupon promotions proves more effective in generating customer engagement.

**Multi-Armed Bandits: Fundamentals**

The A/B test isn’t the only way to conduct an experiment. The *multi-armed bandit* is an alternative to the traditional A/B testing approach, which uses adaptive learning to choose the best variant among *many* options (as opposed to merely two).

**A Primer**

The term “multi-armed bandit” is often introduced with a gambling example: a bettor must decide between multiple single-armed slot machines, each with an unknown, predetermined payout ratio. The bettor approaches the problem rigorously, playing every machine, tracking his return on investment for each. After a while, he sees that one machine has paid out a little more than the others, but he’s still not sure if it’s *really* the best of the bunch. How and when can the gambler be confident that he’s identified the most profitable machine?

Before we start getting into the details, it’s important to understand two main concepts, both vital to multi-armed bandits and reinforcement learning in general: *exploration* and *exploitation*.

**Exploration vs. Exploitation**

In simple terms, exploration is when the gambler tests the machines (often just referred to as “arms”) that aren’t currently paying out most to see if they’ll pay out more; by contrast, exploitation is when the gambler pulls the arm that’s most profitable so far in hopes of maximizing his total return. Unlike in an A/B testing scenario (which is a purely exploratory design), the multi-armed bandit has both exploration *and* exploitation phases, which alternate in an adaptive manner based on the arms’ ongoing performance.

The goal of multi-armed bandit algorithms is to determine the best choice among a set of options and then exploit it. To accomplish this, the figurative bettor must spend some of his time exploring the options, but also intermittently exploit the best choice to reduce the amount of revenue lost by playing suboptimal machines. It is this struggle to balance exploration and exploitation that defines multi-armed bandits.

So, how does the bandit decide to explore or exploit? This decision depends on the type of bandit algorithm being utilized.

**Multi-Armed Bandits: Variants**

**Epsilon-greedy**

One of the simplest and most frequently used versions of the multi-armed bandit is the *epsilon-greedy* approach.

Thinking back to the concepts we just discussed, you can think of the words “epsilon” and “greedy” as related to exploration and exploitation, respectively. The epsilon in question is a constant between 0 and 1, set by the experimenter before the test begins, that dictates the approximate fraction of iterations in which the bandit chooses to explore rather than exploit.

For example, say that a data scientist running an epsilon-greedy experiment selects an ε value of 0.2. Each iteration of the epsilon-greedy algorithm begins with a random number generator producing a value between 0 and 1. If it’s less than the stated epsilon value, then the bandit uses that trial to explore; if not, it exploits the current-best variant. This algorithm offers great flexibility, letting the experimenter customize his/her tolerance for exploration with ease.

Expanding on this, you can think of an A/B test as a purely exploratory epsilon-greedy experiment (i.e., with an ε value of 1). Once the best option becomes clear via the A/B test, the test (exploration) stops and then the best variant becomes the status quo (exploitation).

**Epsilon-decreasing**

The *epsilon-decreasing* strategy is a clever extension of the epsilon-greedy strategy. Instead of selecting a fixed value of ε like in epsilon-greedy, the epsilon-decreasing strategy lets us choose an initial ε value that gradually decreases over time (typically with an exponential decay). This ensures that substantial exploration occurs in the experiment’s early stages, but as time progresses and the best choice becomes more certain, fewer iterations are spent on suboptimal choices.

**Bayesian Bandits with Thompson Sampling**

However, not all bandits are epsilon-based. The Bayesian bandit uses a more sophisticated method of dealing with the exploration-exploitation tradeoff. (It’s not necessary, but if you’re interested in a quick refresher, check out our *Bayes’ Theorem 101* post). The algorithm starts by picking a distribution for each arm’s payout rate, known in statistics as a prior. Though experimenters can choose any type of prior distribution, the most commonly used choice is the beta distribution, often used to model random variables. The beta distribution uses two parameters, each of which is incrementally updated based on observed results. As the bandit conducts more and more trials, each arm’s distribution becomes taller and narrower as it converges on its true success rate. The figure below is an animated simulation of how Bayesian updating works, comparing the observed click-through rates of three ads (with actual click through rates of 10%, 50%, and 80%) over time.

Imagine that a website is testing out some new layouts to improve click-through rate. Every time a user accesses the site, a bandit automatically samples one possible click-through rate from each layout’s distribution. The site then displays the variant with the highest sampled rate. Whether or not the user clicks through is recorded, and then that layout’s distribution parameters are updated accordingly.

As more data points are observed, the experimenter becomes increasingly confident about the *sampled* success rate being close to the *true* success rate. The randomness involved in selecting which arm to pull ensures that exploration occurs consistently, and with greater frequency at the beginning of the experiment. Exploitation increases over time, as sampled rates get closer and closer to true rates, improving overall performance.

**Regret and Reward**

A critical step in each of these three bandit algorithms is determining the current best arm in order to exploit it. That’s where *reward* comes into play. Think of reward as the positive outcome associated with pulling an arm; for example, whether or not a user clicked on a web page layout, or interacted with a tile in an app when it’s presented to them. In short, it’s whatever the measure of success is in the problem that the bandit is trying to solve. The goal of the multi-armed bandit is to maximize total cumulative reward.

*Regret*, on the other hand, is the difference between the reward *actually* collected by the experiment and the reward that *would* have been received if the final best arm been pulled at every step. A bandit’s main evaluation metric, regret is calculated only at the end of an experiment, as the arm that was ultimately chosen must serve as the measuring stick for success. Think of regret like an R² value, something that tells you how effective the bandit was overall.

**Comparison**

While all three of these bandit types have their place, they each have their own strengths and weaknesses.

The two epsilon methods tend to outperform Thompson sampling during the early stages of an experiment, when results are based more on randomness. But Thompson sampling bandits catch up as more data points become available, and eventually accrue less regret and more reward over time. This is the result of a crucial design distinction: epsilon methods give all suboptimal choices an equal chance of being shown during exploration, while Thompson sampling chooses the best among sampled rates at each iteration.

Thompson sampling converges on the optimal choice more quickly than the two epsilon methods. However, it’s also the most computationally complex, meaning it could be more difficult to put into production (in certain situations, it can be nearly impossible to calculate the posterior distribution required for Thompson sampling).

While the following may not *always* be the case, the general tendencies of the three algorithms can be summarized below.

**Performance (Reducing Overall Regret)**Thompson sampling

**>**Epsilon-decreasing

**>**Epsilon-greedy

**Speed of Convergence to Optimal Arm**Thompson sampling

**>**Epsilon-decreasing

**>**Epsilon-greedy

**Ease of Implementation**Epsilon-greedy

**>**Epsilon-decreasing

**>**Thompson sampling

**Conclusion**

Developers often joke that A/B testing stands for “Always Be Testing,” and though A/B tests aren’t always the best way to conduct experiments, testing is a great way to help understand user behavior. Multi-armed bandits are a very powerful and versatile tool for running tests, enabling quick and efficient experimentation with ease.

If you enjoyed this *Opex 101* entry, check out our other posts on random forests, vehicle routing, and scheduling problems.

*If there’s a topic you’d like us to cover as part of *Opex 101*, let us know in the comments below!*

_________________________________________________________________

*If you liked this blog post, check out more of **our work**, follow us on social media (**Twitter**,** LinkedIn**, and** Facebook**), or join us for our** free monthly Academy webinars**.*