Fundamentals of Reinforcement Learning : The K-bandit Problem, Illustrated

Published in

GradientCrescent

6 min readSep 14, 2019

Welcome to GradientCrescent’s special series on reinforcement learning. This series will serve to introduce some of the fundamental concepts in reinforcement learning using digestible examples, primarily obtained from the” Reinforcement Learning” text by Sutton et. al, and the University of Alberta’s “Fundamentals of Reinforcement Learning” course. Note that code in this series will be kept to a minimum- readers interested in implementations are directed to the official course, or our Github. The secondary purpose of this series is to reinforce (pun intended) my own learning in the field.

Introduction

Reinforcement learning has quickly captured the imagination of the general public, with organisations such as Deepming achieving success in games such as Go, Starcraft, and Quake III, along with more practical achievements such as disease detection and self-mapping. These achievements have helped move deep learning towards achieving artificial general intelligence, the holy grail of all AI studies in terms of emulating human-centric applications.

But what is reinforcement learning? The definition on Wikipedia states that Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The field stands independent of supervised and unsupervised learning as the third category of deep learning.

The K-armed bandit problem is a classic introductory problem within the field. Given a K-number of systems, each with an accompanying reward distribution that produces a numeric reward value when invoked, how do we identify the most rewarding system?

A five armed bandit posse?

Let’s tackle this problem in more detail.

Discussion

For the sake of argument, lets assume our problem is stationary, in that the reward distributions do not change over time.

Let’s illustrate the K-arm bandit problem with a few examples.

Imagine a slot machine with a single arm and three rows of symbols, this is commonly known as a “one-armed bandit”

With each pull, the symbols spin, and arrange themselves to form a linear combination of some kind, yielding variable rewards, depending on the symbol combination. We can extrapolate this to cover k-machines, each machine yielding different reward distributions. With more observations with instance rewards- we learn more about the expected reward. A simple definition could be the average reward, averaged over the number of times that action has been called.

We can define this reward as the expected reward, Q(a), which could be the average return from our reward distributions.

Average expected rewards from an action a, from Sutton et. al.

Let’s assume that we start pulling the lever on our machines. As we collect more and more data, our estimate of the true average reward Q(a) becomes more accurate. We can rewrite the equation above to account for this incremental change:

The equation above can be viewed as:

Where the expression [target-oldestimate] is an error of the estimate, reduced by moving a certain step towards the target, where the latter is presumed to indicate the desirable direction in which to move. By adjusting the stepsize, we can also adjust the relative contribution of older or newer rewards, which is necessary when accounting for non-stationary reward distributions.

(While this may look familiar to you to fully connected neural networks, but note how we aren’t aiming to converge to a single target, as individuals rewards may vary)

We’ve discussed the rewards from a single system, but what if we are dealing with 10 different systems, each with individual reward distributions?

10-armed testbed case: 10 actions with corresponding distributions (Sutton)

How do we decide which lever to pull? Primarily, there are two strategies to approach this:

Exploitation

Using our current knowledge of the Q(a)’s, we aim to maximize our current reward by pulling the lever of the machine that has given us the best reward so far. A system adopting this approach is classified as greedy.

Exploration

We forgo choosing the lever with the highest perceived Q(a), instead choosing to learn more about the other machines that may yield smaller (or bigger) rewards. A system adopting this approach is classified as non-greedy.

Our objective is hence to identify the best course of action to maximize our reward.The dilemma lies in the fact that we cannot exploit and explore simultaneously. This trade-off can have serious consequences. Consider a medical trial with three medicines, all aiming to cure the same ailment.

Three different medicines, with three different outcomes. Source

A doctor then keeps track of the treatment outcome of a patient using each medicine. However, as the physiologies of different individuals may vary, so too can the treatment outcome. The trial’s objective is to find the best medicine, but how many patients must suffer from poorer performing counterparts until you are satisfied with your conclusions?

One way to strike a balance between the two is by incorporating a set probability, epsilon, to invoke an explorative action. The figure (Sutton et. al) below displays the average cumulative reward of a 10-armed test bandit incorporating 3 separate epsilon values, taken over 1000 steps and averaged over 2000 attempts.

Note how the inclusion of explorative means allows the system to reach a higher cumulative reward faster. Naturally, variations in these patterns may exist depending on the reward distributions of your system.

There are additional ways to encourage exploration, beyond the use of a set probability, that rely on the principle of optimism in the face of uncertainty.

Reward Inflation

We can artificially inflate (or formally optimistically estimate) the rewards of our actions before even executing them for the first time. Even if the initial real eward of an action is low, the resulting average reward still being relatively high, encouraging exploration of actions that would otherwise be penalized by the inherent variation of the reward distributions

Upper Confidence Boundary Selection

Herein, instead of using a pure average value, we consider the maximum possible upper confidence boundaries. This can be calculated thusly:

In essence, we calculate the maximum plausible reward value of an action, and use that as the basis of our choice.

That wraps up this introduction to the K-armed bandit problem. For a detailed implementation in Python, we recommend the reader to consult the official course, or alternatively our attempt at the course’s tutorial, available on Github. Note that due to Coursera’s Honor Code, we recommend you complete the tutorial first before consulting our attempt.

We hope you enjoyed this article. To stay up to date with the latest updates to GradientCrescent, please consider following the publication.

References

Sutton et. al, Reinforcement Learning

White et. al, Fundamentals of Reinforcement Learning, University of Alberta

Fundamentals of Reinforcement Learning : The K-bandit Problem, Illustrated

Introduction

Discussion

Written by Adrian Yijie Xu, PhD