Markov Decision Processes& model Base Algorithm

Figure. 1

Markov Processes

as in Figure.1 the MP is sequence of states with markove property. The full definition .

  1. S finite set of states
  2. P is dynamic model of the world P (s ₜ = s ′ |sₜ= s) lets call the vector X ₜ is the state representation at time step t and The matrix P is the dynamic model with NxN dimension N is number of states

Markov Reward Processes

in MRP we make a little twist we add Reward function to Markov Chain
we still have dynamic model as before but with reward function.
Definition of MRP:

  1. S is finite set of states (s ∈ S)
  2. P is Dynamic model that specifies P(s ₜ₊₁ = s|s ₜ= s ′ )
  3. R is the Reward function R(sₜ= s) = E[rₜ |sₜ = s]
  4. γ is discount factor γ ∈ [0, 1]
Figure. 2
  1. if the processes is deterministic then G t and V are same, but if the processes is stochastic they will be different
  2. if episode length are always finite,you can use γ = 1
  1. could estimate by simulation: Generate a large number of episodes, Average return, compute the Expected return.
  2. if the world is Markovian then we can decompose the Value function in
    V (s) = R(s) + γ ∑ P (s ′ |s) V (s ′ ) … (5)

Markov decision processes

were first described in the 1950s by Richard Bellman. They resemble Markov chains but with a twist: at each step, an agent can choose one of several possible actions, and the transition
probabilities depend on the chosen action. Moreover, some state transitions return some reward (positive or negative), and the agent’s goal is to find a policy that will maximize reward over time.
its just MRP + actions
Definition of MDP.

  1. S is finite set of states (s ∈ S)
  2. A is set of actions a ∈ A
  3. P is Dynamic model for each action that specifies P(sₜ₊₁ = s|sₜ = s ′ , aₜ=a)
  4. R is the Reward function R(sₜ= s, aₜ= a) = E[rₜ|sₜ= s, aₜ= a]
  5. γ is discount factor γ ∈ [0, 1]
Figure. 3
  1. initialize V (s) = 0 for all s.
    2. While True:
    for all s ∈ S:
    V ₖ(s) = R(s, π(s)) + γ ∑ₛ ′ π (s ′ ) P (s ′ |s, π(s))Vₖ₋₁
    |Vᵢ− Vᵢ₊₁ | < ϵ break


we don’t want just evaluate the policy we want to learn a good policy and the is the Control to
compute the optimal policy:

Value Iteration.

let’s define the Bellman Backup Operators

BV(s) = max ₐ R(s, a) + γ ∑ₛ ′P (s ′ |s, a)V (s ′ )


in this article we introduce Markov decision Processes and model base Algorithm, which use to find the best action to do in some situation(state) if we know the dynamic model (The model of the world ). we introduce Policy Iteration , Value Iteration algorithm which as we proof converge to the optimal policy if I role the algorithm for enough the next Article we will introduce Free-model Algorithm where I can learning value function not just planning .


[1]E. Brunskill .CS234: Reinforcement Learning | Winter 2019



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store