[ Archived Post ] Random Notes for MDP and State Value Function
Please note that this post is for my own educational purpose.
Interesting viewpoint → any RL task → that has those components are MDP → yeah I guess so and the solution to that MDP can be either value or policy iteration.
POMDP → this case we need the history also nice to take note that partially observable is not the same thing as having some randomness.
Good summary of Markov property. And second-order Markov just extends the above dependency to x-2 time.
And when things are Markovian → the transition only depends on the current state → not all of the history.
Again policy is the solution to the given MDP.
Real world example of value iteration → the values will get updated for state X C first then A.
How many states we have as well as how many actions we have for each state → we have 16 and 4 → so in the combination, we have 48 choices.
But since for each action, → there is some randomness → probability of going somewhere else → and this makes the total combination of action and state space larger.
For every state, → every action → uses the probability as well as a discount factor and reward to calculate the state value for being in that state.
How the state value function changes over time → the first iteration only 14 increase → this is because there is no negative reward for the hole.
- Implement Reinforcement learning using Markov Decision Process [Tutorial]. (2018). Medium. Retrieved 8 January 2019, from https://medium.com/coinmonks/implement-reinforcement-learning-using-markov-decision-process-tutorial-272012fdae51
- hollygrimm/markov-decision-processes. (2019). GitHub. Retrieved 8 January 2019, from https://github.com/hollygrimm/markov-decision-processes