Multi-Armed Bandits: An Overview on Classical RL algorithms

4 min readOct 17, 2023

Introduction

In recent years, Reinforcement Learning (RL) has carved a significant niche within the vast realm of machine learning applications. It’s the underlying force behind numerous advancements, from playing intricate board games at a superhuman level to real-world robotic controls. The most recent stride in the domain of generative AI, especially in the development of Language Learning Models (LLMs), owes much to the principles of RL.

As we take a deep dive into the foundation of RL, the concept of “Multi-Armed Bandits” (MAB) prominently stands out. This blog series, drawing insights from Richard Sutton’s seminal work on RL, will meticulously unpack the complexities of MAB for you.

This blog will be presented in a multipart series:

Part 1: Overview on “Multi Armed Bandits (MAB)”
Part 2: In-depth Exploration on Epsilon-greedy and Epsilon-decay
Part 3: A Comprehensive Analysis of Thompson sampling
Part 4: Delving Deep into the Upper Confidence Bound method
Part 5: Probing the e-greedy Annealing algorithm
Part 6: An Exhaustive Examination of the Gradient bandit algorithm
Part 7: Understanding the non-stationary behavior and modeling

Without further ado, let’s embark on this journey with Part 1!

Background of MAB

The term “Multi-Armed Bandits” refers to a problem where an agent must decide between multiple choices (arms) without knowing the outcome. It is named after the slot machines in casinos, where each lever (or “arm”) of the machine can render a different reward.

The challenge? Determining the optimal arm to pull to maximize the total reward over a series of trials.

Use Cases of MAB

From online recommendation systems to clinical trials, the MAB problem is omnipresent.

Online platforms, like Netflix or Amazon, use it to decide which movie or product to recommend, ensuring users get the most relevant content.

In clinical trials, MAB assists researchers in allocating patients to the most effective treatments without prior knowledge of outcomes.

Strengths and Weaknesses of MAB Algorithms

Strengths:

Simplicity: Many MAB algorithms, especially the basic ones, are easy to understand and implement.
Online Learning: They adapt over time, improving as more data is collected.
Efficient Exploration-Exploitation Trade-off: They balance well between trying new arms and sticking to the best-known ones.

Weaknesses:

Assumption of Stationarity: Some MAB algorithms assume the environment doesn’t change, which isn’t always true.
Scalability: As the number of arms increases, some methods might not scale well.

List of Popular MAB Algorithms

Epsilon-greedy:

A straightforward method where the agent primarily chooses the best-known arm.
Occasionally, with a small probability (ε), it randomly selects any arm, allowing for exploration.

Thompson Sampling:

Uses a probabilistic model to determine the potential reward for each arm, often using a Beta distribution.
The arm with the highest sampled reward value is selected to explore the action-value space.

Upper Confidence Bound (UCB):

Selects the arm with the highest upper confidence bound, considering both the potential reward and uncertainty.
As more is learned about an arm, its uncertainty decreases, and decisions become more exploitative.

e-greedy Annealing:

Similar to Epsilon-greedy, but the exploration probability (ε) decreases over time.
As more is learned, the algorithm relies more on exploitation and less on exploration.

Gradient Bandit:

Estimates the preference for each arm rather than their value, updating preferences based on received rewards.
Arms are chosen using a soft-max distribution on the preferences, balancing exploration and exploitation.

High-Level Understanding of MAB Methods

Imagine an online advertising scenario where a marketer has three different advertisements (Ad1, Ad2, Ad3) and wants to determine which one is the most effective.

5.1 Random Trial (Explore): Initially, the marketer may choose to display each ad randomly to different users. This stage is purely explorative, aiming to gather data on the performance of each ad.

5.2 Rewards and Action Value: As users interact with the ads, the marketer receives feedback. If a user clicks on an ad, it’s a reward. Over time, each ad accumulates an “action value” — an average of the rewards.

5.3 Exploit and Explore Together: With sufficient data, the marketer may notice Ad2 has the highest action value. But should they exclusively display Ad2? This is where the explore-exploit dilemma comes into play. The marketer can choose the best-performing ad most of the time (exploit) but occasionally test other ads (explore).

5.4 Convergence: As more and more data is collected, the marketer will have a clearer idea of the best-performing ad.

The algorithm will, over time, converge to the optimal ad, ensuring the maximum cumulative reward.

Stay tuned for Part 2 where we’ll dive deeper into the fascinating world of Epsilon-greedy and Epsilon-decay! A detailed walk-thru of the classical MAB algorithm called “epsilon-greedy” and “epsilon-decay” methods.
About the Author: Bhupen (bhupen@gridflowai.com)

I’m an avid AI enthusiast with a deep-rooted passion for teaching — from foundational concepts to hands-on implementation in ML, DL, RL, and NLP. My journey in the IT realm spans over 31 years, with 14 cherished years dedicated to teaching and mentoring. The world of AI is vast, and I take immense pleasure in making it accessible and comprehensible to all.