Alpha Shuro
Aug 27, 2017 · 2 min read

Thank you, this was a fun and very interesting post to read and really opened my eyes to reinforcement learning. You have created a significant checkpoint in my understanding of this subject.

I do have some feedback on your chosen “n armed bandit” problem:

The terms “bandit(s)” and “slot machine(s)” were loosely defined in the post, causing a lot of confusion about what they meant and how the abstractions were related to the variables being used in the code. I found myself in the end converting all my code and comments to use “slot machines” and suddenly it all made sense, because the array of values in the first variable (bandits in your example) is actually an array of slot machines that the agent must choose from, and each “action” is the chosen slot machine to play, the reward function (pullBandit in your code example) is the action of actually pulling the selected slot machine’s lever to see if we win anything, and the reward is the slot machine’s “win” or “lose”, with the final selected result being the slot machine with the highest possible chance of a win.

I think it would be useful to apply this change to make the visualization simpler, keeping the abstractions to a minimum, as this leaves the reader with a better chance of understanding the code much faster.

Also, upon experimentation with the starting values of the weights i finally realized that you chose tf.ones because an array of random values would fail to accurately reflect the accumulating scores of the machines over such a small number of episodes given the tiny learning rate, and tf.zeros would cause the loss function to always return inf since tf.log(0) is inf . Most mathematicians would probably already know this, but I think it would be useful for beginners to include this information in your post, since it seems the post’s target audience is beginners.

)

    Alpha Shuro

    Written by

    programmer. human?