Reinforcement Learning Chapter 2: Multi-Armed Bandits (Part 2 — Action Value Methods)
Chapter 2 Series:
- Part 1 — Introduction
- Part 2 — Action-Value Methods
- Part 3 — Nonstationary Rewards
- Part 4 — Upper-Confidence-Bound Action Selection
- Part 5 — Gradient Bandit Algorithms
- Part 6 — Associative Search
Code: https://github.com/nums11/rl
In the previous article, we learned about the Multi-Armed Bandits problem. Now we can introduce the first set of methods that can solve this problem: action-value methods.
Action-Value Methods
Action-value methods are a group of solutions to the Multi-Armed Bandits problem that focus on getting accurate estimations of the value of each action & using these estimations to make decisions about which action to choose.
Action-value methods have 2 parts:
- Estimating the value of a given action
- Selecting which action to choose based on the estimated value of all of the actions.
Estimating Action Values
The simplest way to estimate the value of any particular action is to average the rewards that were received every time this action was taken. This is called sample averaging.
This can formally be defined as:
- The ‘1’ on the right side of the equation is a simple variable that returns 1 if the predicate is true and 0 otherwise.
Incremental Updates
The simplest implementation of sample averaging would be to store the rewards for each action as well as the number of times each action was taken, then these could be used to calculate the above sample averaging formula.
The problem with this implementation is that the memory and computation grow as more rewards are seen. Instead, we’d like to avoid having to save all of the rewards to lower memory and computation needs. This can be done through incremental updating.
For any action:
Rᵢ = reward received after the ith selection of this action
Qₙ = estimate of this action’s value after it has been selected n-1 times
We can now calculate Qn+1 or the new action value estimation that takes into account the reward that was just received.
This implementation requires memory only for Qₙ and n and only the final computation for each new reward.
You can think of this update rule in the format of:
NewEstimate = OldEstimate + StepSize [ Target — OldEstimate ]
- The expression [ Target — OldEstimate ] is an error in the current estimate of the action’s value. We are moving our estimate toward the “Target” (the current reward) which represents a ground truth (though it may be noisy).
Action Selection: Greedy and Epsilon-Greedy
Now that we know how to estimate the value of actions we can move on to the second-part of action-value methods: choosing an action. There are 2 basic ways to choose an action:
Greedy Action Selection: The simplest way is to always choose the greedy action (the action with the highest-estimated value). This can be thought of as always exploiting.
Epsilon-Greedy Action Selection: Another method is to choose the greedy action most of the time, but choose some other action with some probability ε (epsilon). ε could be any value of our choice (e.g. 0.1 in which case we’d choose the greedy action 90% of the time). In this method we still exploit most of the time but occasionally we explore other actions that may lead to better rewards.
Optimistic Initial Values
The final detail before we can implement a solution is to understand how we can initialize our estimates of the action values.
- The simplest way is to initialize them all to 0 and this works fine.
- Another way is to initialize them optimistically — start them off with values much higher than they could ever be in reality.
For sample average methods, whichever way we choose to initialize the estimations doesn’t really matter in the long-run because the bias begins to disappear once all actions have been taken at least once.
In the short-run however, optimistic initialization encourages exploration. Since the first few rewards for any action are guaranteed to be below their estimated value, the agent is forced to try out other actions for which it is already estimating higher. With this method, greedy agents can be encouraged to explore in their initial phases which increases the likelihood of them finding the best action.
Implementation
Now that we know how to estimate the values of actions and how to select an action based on those estimations, we can combine these steps into a simple algorithm to solve the bandit problem.
Below is a python implementation of 2 agents — both of which use sample averaging with an incremental updates implementation for action-value estimation. Action-values are initialized optimistically.
- GreedyAgent: uses greedy action selection
- EpsilonGreedyAgent: uses epsilon-greedy action selection
Performance
We can test these agents on different versions of the Multi-Armed Bandits problem implemented using the OpenAI Gym Framework. An existing implementation exists here.
- You can download the repo via pip: pip install git+https://github.com/JKCooper2/gym-bandits#egg=gym-bandits
- Then import it:
import gym_bandits
BanditTwoArmedDeterministicFixed-v0: This is the simplest version of the bandit problem where there are only 2 arms, one that always pays and one that doesn’t.
BanditTenArmedRandomRandom-v0: This is a more complex version of the problem where there are 10 arms each with a random probability assigned to both payouts and rewards.
The code below can be used to test out the agents (simply uncomment the environment).
For the simple BanditTwoArmedDeterministicFixed-v0 environment both agents learn quickly and the greedy agent ends up with a slightly better cumulative reward (999 vs. 941 over 1000 episodes) because it never stops to explore once it finds the best action.
For the more complex BanditTenArmedRandomRandom-v0 the epsilon-greedy agent can beat the greedy agent or vice-versa. It depends on if the initial exploration forced on the greedy agent by the optimistic inital-value estimation is enough for it to stumble on the best action.
This is the most simple version of the Multi-Armed Bandits problem. We’ll discuss some harder variations of the problem and new solutions next.