Reinforcement Learning Chapter 2: Multi-Armed Bandits (Part 1 — Introduction)
Chapter 2 Series:
- Part 1 — Introduction
- Part 2 — Action-Value Methods
- Part 3 — Nonstationary Rewards
- Part 4 — Upper-Confidence-Bound Action Selection
- Part 5 — Gradient Bandit Algorithms
- Part 6 — Associative Search
Code: https://github.com/nums11/rl
The k-armed Bandit Problem
Chapter 2 discusses the k-armed bandit problem which is a simpler version of a full RL problem that can be used to highlight the exploitation vs. exploration dilemma.
The term “k-armed bandits” is used to be analogous to a slot machine that has k-levers instead of 1. Each action is like pulling a lever and each lever outputs some reward with different probability.
The goal is to maximize winnings by learning which levers output the highest rewards with the highest probabilities and pulling those as much as possible.
Terminology
To formally define the problem some basic mathematical notation is introduced.
Aₜ = action taken at time t
Rₜ = reward at time t
q*(a) = the value of any action a, defined as the expected reward for that action if it’s chosen
- Formally, q*(a) = E[Rₜ | Aₜ = a]
- Or the expectation (E) of getting some reward at time t (Rₜ) given that the action at time t is ‘a’ (Aₜ = a)
Qₜ(a) = estimated value of action a at timestep t
- Since we don’t know the actual value of any action we have to estimate
The goal is to get Qₜ(a) to be as close to q*(a) as possible.
- If we can closely estimate the true values of each action, we can know which actions to take to maximize reward.
Exploration vs. Exploitation
At any point in time there will be one action whose estimated value is highest, we call this action the greedy action.
Exploiting: Choosing the greedy action is called exploiting. This is because we are making use of the knowledge we have gained thus far to choose the action we think will give us the highest reward.
Exploring: Choosing an action other than the greedy action is called exploring. This is because we are trying a different action to see if we can learn something new (maybe the action we think is the best isn’t actually the best).
Exploitation might maximize reward for 1 timestep but exploration could maximize reward in the long run. The conflict between choosing when to exploit your knowledge of the environment vs. when to explore the environment in hopes of gaining new knowledge is called the exploration-exploitation dilemma.
Now that we understand the multi-armed bandits problem we can start looking at how it can be solved.