[ Archived Post ] Random Notes for MDP and State Value Function

Please note that this post is for my own educational purpose.

Interesting viewpoint → any RL task → that has those components are MDP → yeah I guess so and the solution to that MDP can be either value or policy iteration.

POMDP → this case we need the history also nice to take note that partially observable is not the same thing as having some randomness.

Good summary of Markov property. And second-order Markov just extends the above dependency to x-2 time.

And when things are Markovian → the transition only depends on the current state → not all of the history.

Again policy is the solution to the given MDP.

Real world example of value iteration → the values will get updated for state X C first then A.

How many states we have as well as how many actions we have for each state → we have 16 and 4 → so in the combination, we have 48 choices.

But since for each action, → there is some randomness → probability of going somewhere else → and this makes the total combination of action and state space larger.

For every state, → every action → uses the probability as well as a discount factor and reward to calculate the state value for being in that state.

How the state value function changes over time → the first iteration only 14 increase → this is because there is no negative reward for the hole.


  1. Implement Reinforcement learning using Markov Decision Process [Tutorial]. (2018). Medium. Retrieved 8 January 2019, from https://medium.com/coinmonks/implement-reinforcement-learning-using-markov-decision-process-tutorial-272012fdae51
  2. hollygrimm/markov-decision-processes. (2019). GitHub. Retrieved 8 January 2019, from https://github.com/hollygrimm/markov-decision-processes