Reinforcement Learning Chapter 2: Multi-Armed Bandits (Part 5 — Gradient Bandit Algorithms)

3 min readFeb 22, 2023

Chapter 2 Series:

All of the previous methods for solving the Multi-Armed bandits problem thus far have been action-value methods: methods that work by estimating the value of actions and then selecting an action based on that value. In this article, we’ll cover a whole new class of methods — gradient bandit methods.

Gradient Bandit Algorithms

As opposed to action-value methods that estimate the value of actions, gradient bandit methods don’t care about the value of an action but rather learn a preference for each action over another.

This preference has no interpretation in terms of reward. only the relative preference over other actions matters.

Formalization

The probability of selecting any action is given according to a soft-max distribution:

If you’re not familiar with soft-max, it is a function that outputs probabilities over a set of numbers such that the probabilities sum to 1.
In this case, the set of numbers are the actions, so each action will be given a probability of being selected and the probabilities of all those actions will sum to 1.
For example, if there are 3 actions and the game is at some timestep t you could have: πₜ(1) = 0.3, πₜ(2) = 0.6, πₜ(3) = 0.1

The value Hₜ(a) represents the preference of selecting an action a over the other actions. On each step, after selecting the action Aₜ and receiving the reward Rₜ, the action preferences are updated:

The top part of the equation represents the preference update for the action just taken and the bottom part represents the preference updates for all other actions
α > 0 is a step-size parameter that determines the degree to which the preferences are changed at each step
R¯ₜ or “R mean at timestep t” (Medium won’t show the dash above the “R”) is the average of all the rewards from all timesteps including timestep t for Aₜ. Similar to previous methods, this can be updated using incremental updates with the sample-averaging method or a constant-α parameter that is better suited for nonstationary rewards.
R¯ₜ can be thought of as a baseline against which the current reward is compared. If the reward is higher than the baseline, then the preference for this action is increased. If the reward is lower than the baseline, then vice-versa.
The bottom part of the equation moves the preferences for all other actions in the opposite direction.

Implementation

Below is a python implementation of all the previously discussed agents as well as a Gradient Bandit Agent.

Performance

We can test the Gradient Bandit Agent with different α values on the previously defined nonstationary environment.

Here are the results:

The best α value seems to be around 50 and this scores an avg. reward of 0.84 which is worse than the UCB-Agent (0.98) and similar in performance to the epsilon-greedy agent with constant α estimation (0.88).

Now we’ve learned all of the basic solutions to the Multi-Armed Bandits problem. In the next article, we’ll summarize our learnings and learn about the difference between this problem and the full reinforcement learning problem.

Reinforcement Learning Chapter 2: Multi-Armed Bandits (Part 6 — Associative Search)

In the previous articles, we’ve learned about the Multi-Armed Bandits Problem as well as how different solutions for it…

medium.com