Reinforcement Learning Chapter 2: Multi-Armed Bandits (Part 3 — Nonstationary Rewards)

Numfor Tiapo
4 min readFeb 18, 2023

--

Chapter 2 Series:

Code: https://github.com/nums11/rl

In the previous articles, we’ve been looking at the most simple version of the Multi-Armed Bandits problem. Now we can increase the complexity to take one step closer to the full reinforcement learning problem.

Stationary vs. Nonstationary Rewards

So far the environments we have discussed have stationary rewards. Simply stated this means that the reward distributions do not change over time — any action would give out a reward with the same probability the 1st, 100th, and 1000th time it was selected.

In reality, most reinforcement learning problems have nonstationary rewards — the reward distributions change over time. The methods that we have learned so far struggle to handle this type of environment.

Consider the simple 5-arm bandit environment below in which 1 arm outputs a constant reward while other arms output no reward. The only difference is that every 500 steps in the environment, the reward outputting arm changes randomly.

The previously shown Greedy and Epsilon-Greedy methods struggle with this change in reward distribution. Over 10k episodes the greedy agent scores an average reward of roughly 0.15. The e-greedy agent scores better at 0.31 but it is still quite poor.

The Problem with Sample Averaging

The reason that the agents perform poorly when the reward distribution is suddenly changed, is because their action-value estimations are heavily based on rewards from an old distribution.

Recall the sample averaging method which is used to update the action-value estimations:

where α = (1/n)

At the start of the game the best action could be action 1. Then 500 steps later it could change to action 4. Another 500 steps and it could change to action 2. At this point, choosing action 1 is clearly a bad action but the sample-averaging method would still rate it somewhat favorably. This is because at one point in time it was the best action, and the sample-averaging method works by averaging all the rewards received from an action from the beginning.

Constant Alpha

We’d like to be flexible to a changing reward distribution such that when the distribution changes, we can quickly learn which actions are now good to take and which actions are no longer good to take.

One way to accomplish this is with a constant α. In the same action-value update equation above we can use a constant value between 0 and 1 instead of setting α = (1/n).

To understand why this works, consider what’s actually happening when α = (1/n): the ability to change our estimation for an action diminishes over time.

  • At episode 10, α = (1/10) = 0.1
  • At episode 1000, α = (1/1000) = 0.001
  • As an action is chosen more and more α shrinks causing increasingly smaller changes to the estimation of the value for that action.
  • This means it can be difficult to change our estimation even if the rewards that action gives us changes, and the estimation can be heavily biased to the rewards that it used to give.

Now consider the case when α is a constant value between 0 and 1 (e.g. α = 0.9).

  • At episode 10 α = 0.9 and at episode 1000 α remains the same.
  • If an action that used to give a reward suddenly stops giving rewards, α will cause the estimation of that action’s value to quickly diminish.
  • Vice versa, α can also cause an estimation to quickly increase if it suddenly starts being rewarding.
  • In this way we can quickly respond to a changing reward distribution.

Performance

You can use the above environment and the below agents and testing file to see how a constant-α agent performs much better in a nonstationary environment.

Here are the average rewards for the 3 agents over 10k episodes:

  • Greedy: 0.09
  • E-Greedy: 0.37
  • C-Alpha: 0.88

It’s clear that the constant-α agent greatly outperforms agents using the sample-averaging method.

Now we’ve learned about nonstationary environments and how to use a constant-α to update action-value estimations in such a way that they can be flexible to new reward distributions. Next we can learn about other methods for action selection besides just greedy and e-greedy.

--

--