How greedy your reinforcement algorithm shall be? A working example in Ruby

Miyama Yuta
4 min readDec 27, 2016

This is a part of reinforcement algorithm study posts.
Part 1 is available here:

Introducing a K-armed bandit machine

So, when you know the basics of constructing a simple AI that learns from the actions, what’s next?

We should take a look at what kind of different algorithms there are, so that we know the rooms for improvements.

To make it simple, let’s assume a simple environment: “k-armed bandit problem”. A casino game called “bandit” deriving its name, this problem exhibits the effectiveness of some of the reinforcement learning (RL) algorithms.

http://www.winneratslots.com/wp-content/uploads/2014/07/One-Armed-Bandit.jpg

K-armed bandit assumes a machine with k arms (or buttons, whatever accepts input from a player), and different arm corresponds to a different reward (# of coins you get if this is a real casino machine).

Because there’re k options that each returns different reward, a player’s aim is to determine the tendency of reward for each arm. In another word, the more games you play, the better you should be good at getting more coins from the game.

Thus, this problem can tell us how well an RL algorithm performs in terms of maximizing the rewards through multiple plays against a specific bandit machine.

Btw, this post is following the book called “Introduction to Reinforcement Algorithm”. I’d recommend reading through the book.

What kind of RL algorithms we’re testing here?

In the previous post, we’ve discussed that an RL algorithm decides what action it takes based on the “past action-reward mapping” it learns.

When an algorithm strictly follows the best action-reward mapping, we call it a “pure-greedy” algorithm; It never risks choosing the unknown (apparently less-rewarding) actions, always play it safe.

Although it may sound smart enough, we should remember that the past rewards mapping is merely a sample of probabilistic values. So most likely, choosing apparently higher reward action prevents a player to experiment with other actions that might give you the higher value.

This kind of action experimenting with uncertainty in mind is called an ‘exploration’. It explores other moves, then learns from it, making the guess more accountable.

Following the book’s notation, we call this kind of algorithms that have an explorative move as its characteristics a ‘near-greedy algorithm’.

When near-greedy algorithm takes an exploratory move (that is, choosing an action irrelevant from the highest reward action), an RL trainer sets the parameter of how likely it enters this exploratory mode. If this parameter (let’s call it greediness) is to be 0.9, 10% of the time the algorithm does an exploratory move.

Let’s make an AI play bandit!

With these different strategies in our mind, let’s build a testing environment. For now, we test 3 different algorithms: pure greedy one, near-greedy with 0.9 greediness, and the one with 0.99 greediness.

To prepare bandit machines, we set the true action-rewards mapping to follow the normal (Gaussian) distribution.
When a machine is first created, we set a “default reward value” for each arm. This value is random number following a normal distribution (median 0, variance 1).

When a player plays a machine (pull the lever), it dynamically calculates the reward using the normal distribution, but this time the mean being the pre-set default reward value (variance is still 1).

Show me the code

So here’s the code in Ruby, and you can run it to observe the performance of each algorithm with the bandit environments described above.

The output from the Ruby program indeed illustrates the characteristics of each algorithm. Please be aware that the script speed and the statistical quality is in the trade-off.

https://plot.ly/~kenzan100/10/

Observation

The pure greedy algorithm, because it never risks to pull the unknown levers, it settles really fast, and the algorithm never achieved the reward value higher than x.

When looking at near-greedy algorithms with different probability of explorative moves, the one with 0.9 (10% exploratory move change) greediness is likely to reach the biggest reward in the volume of 1,000 games against the same bandit environment.

When the greediness is in 0.99 (1% change of exploratory moves), it does not learn as fast as the one with 0.9, but it also keeps learning to get better average rewards.

These observations make us think that the greediness should not be some constant value, but rather an elastic value that should correspond to the dynamics of the environment it is interacting with.

Conclusion

In this post, we explored different parameter setting for simple RL algorithms; the value of how likely an algorithm ‘risks’ itself to explore other moves.
There’re more sophisticated ways of parameter settings, and we should ourselves should explore more of the field.

See you next time, and Happy Holidays!

--

--