Markov Decision Process

Grid World

A maze like problem:

  • The agent lives in a grid.
  • Walls block the agent’s path.

Noisy movement: actions do not always go as planned.

  • 80% of the time, the action North takes the agent North.
  • 10% of the time, North takes the agent West; 10% of the time, North takes the agent East.
  • If there is a wall in the direction the agent would have been taken, the agent stays put.

The agent receives rewards each time step:

  • Small ‘living’ reward each step (can be negative)
  • Big rewards come at the end (good or bad)

Goal: Maximize sum of rewards

An MDP is defined by -

  • A set of states s∈S
  • A set of actions a∈A
  • A transition function T(s,a,s’) . Also called the model or the dynamics. P(s’|s,a) = T(s,a,s’)
  • The reward function R(s,a,s’)
  • A start state
  • Maybe a terminal state
  • MDP’s are non-deterministic search problems

  • “Markov” generally means that given the present state, the future and the past are independent.
  • This is just like search, where the successor function could only depend on the current state (not the history)
  • In the deterministic single-agent search problems, we wanted an optimal plan or sequence of actions, from start to goal.
  • For MDPs we want an optimal policy Π* : S ⇒ A
  • A policy Π gives an action for each state.
  • An optimal policy is one that maxmizes expected utility if followed.
  • An explicit policy defines a reflex agent.


  • Its reasonable to maximize the sum of rewards.
  • Its also reasonable to prefer rewards now to rewards later
  • One solution: values of rewards decay exponentially.

How to discount?
 Each time we descend a level. we multiply in the discount once.

Why discount?
 Sooner rewards probably do have higher utility than later rewards. Also helps algorithm converge.

Example: discount of 0.5
 U([1,2,3]) = 1*1 +2*0.5 + 3*o.25

What if the game lasts forever?

  • Finite horizon: (similar to depth-limited search)
  • Gives non stationary policies (Π depends on time left)
  • Discounting: use 0<ϒ<1. Smaller ϒ means smaller “horizon”.

Absorbing state- guarantee that for every policy, a terminal state will eventually be reached.

How to solve MDPs?

The value (utility) of a state S:

V*(s) = expected utility starting in S and acting optimally.
 (tells us how good each state is)

Q*(s,a) = expected utility starting out having taken action ‘a’ from state ‘s’ and (thereafter) acting optimally.
 (‘s’ is fixed and we vary ‘a’)

The optimal policy:
 Π*(s) = optimal action from state s = argmax Q*(s,a)
 (its the action that achieves the maximum)

Values of States:

Fundamental operation : compute the (expectimax) value of the state.

  • Expected utility under optimal action.
  • Average sum of (discounted) rewards
  • This is just what expectimax computed!

Recursive definition of value:

V*(s) = max Q*(s,a)

Q*(s,a) = max Σ T(s,a,s’)[R(s,a,s’) + V*(s)] — Bellman’s equation


Policy = map of states to actions.
Utility = sum of discounted rewards.
Values = expected future utility from a state (max node)
Q-values = expected future utility from a q-state (chance node)

Both value iteration and policy iteration compute the same thing (all optimal values)

In Value iteration -

  • Every iteration updates both the values and (implicitly) the policy.
  • We dont track the policy, but taking the max over actions implicitly recompute it.

In policy iteration -

  • We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action and not all of them)
  • After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
  • The new policy will be better (or we’re done)

Alright that’s it for now! Thank you for spending your time. Cheers!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.