why I call reinforcement learning the method of two tables

or reinforcement learning without math

Huynh Nguyen
Red Gold
3 min readMay 27, 2018

--

There are plenty of tutorials about reinforcement learning (RL), and most of them do a good job for introducing the concept, the mathematic equation, or how important of RL. Yet, this series of articles aims to make RL concise, simple and vivid to take less than 3 mins of readers.

Let take a check game as an example, given a chess board layout as the input to an RL model, it represents each different layout into a unique State and the number of states could run to infinity as it could be an infinite number of the layout. For each new move of a chess piece (or new Action) as the output, it will return in new game situation with a reward: do not win any chess piece (zero point), win a knight (consider its value is 3 points), lost a Rock with negative rewards of -5 points, checkmate or being checkmated. There are 6 different kinds of pieces in a check game, so we could label each with different reward values up to us but winning the game always be the infinity reward or the Goal. Now, we keep a tuple of (state, action, reward) into a table called environment table.

The Reward is good but the Goal is to win. Even though we could move a Knight to take the opponent’s Queen, we then lose the game in the next move. Over a large number of trials, RL tries to predict the actual Value of each move of each state to the final outcome of a game. And we call the RL predicting value method as Policy. This Value could be calculated by accumulating the Rewards received after each move, however, it is better if we reduce the reward with a discount factor based on the number of steps, to avoid the case that the RL policy tries to run a forever game (could be other reasons as well). Now, we have another table that keeps the (state, action, value) tuple called policy table.

It is worth to remember that RL policy chooses an action with the highest predicted value, not the reward it observes from the environment. If there is a new state, the environment table does not keep any records of the action on this state, but policy table, on this same state, can generate a predicted value of each action. Eventually, the predicted value becomes accurate and then we have a good policy. We base on the number of game it takes to the RL algorithm to fill up a good policy table to determine the quality of the RL algorithm.

However, a chess game does not stop after an single action, we should also anticipate the opponent’s action that creates the next state. So the Value of the current action can not be naively defined by the current state, but also the next state. Let discuss about this on the next post.

any contribution to this article is very welcome.

--

--