If we know the chance of opponent next move, everything is simple

Huynh Nguyen
Red Gold
Published in
3 min readJun 3, 2018

Reinforcement learning without math (part 2)

It would be simple if the game ends after one move, but it does not. There are normally multiple potential next states depending on the opponent next move. Finding accurate Value requires all the future states, and the chance of each state transition be estimated correctly.

What if we have a small problem such as 3x3 tic-tac-toe game and we play it multiple times with the same person who follows the same strategy, we can guess the chance of the opponent’s next move and we also get the chance or the probability of the next state happened. With those information, we can draw a graph representing the state transition. It also allows us to count the possible next states of an action, then estimates the probability of each next-state. Markov Decision Process (MDP) is the technical term for this case.

Represent Markov Decision Process to environment table

With MDP assumption about the environment, all possible future states of an action and the chance of each state transition is known and unchanged, it gives us the advantage that we can estimate the State Value table that keeps track of the State Value estimated by sum up the Reward received after each state transition proportional to the chance of that state transition. Those State Value is then used to estimated each Action Value in the Policy table by calculating the sum of the State Value of all future states proportional to the chance of each state transition. Now, the accuracy of action Value relies on the correctness of State Value table.

Illustration of one step update of Policy table with the state value table

As the initial Value in the State Value table is supported to be wrong as we have not seen the rewards of any state, we need to update the State Value table with the received Reward which does not come at same time. It requires us to run an action multiple time to get all state rewards and also comes to the decision of when to update the Value, either after each action (action iteration) or waiting until convergence (policy iteration).

However, MDP approach is not so practical in real life due to the explosion of states and the difficulty to predict the chance of each state transition. It just likes trying to count the possible outcomes of a move in a check game which respects to the next action of the opponent. The model-free approach is more practical as it requires less assumption about the environment and we will discuss about a famous approach of on next post.

--

--