Markov Decision Process
A maze like problem:
- The agent lives in a grid.
- Walls block the agent’s path.
Noisy movement: actions do not always go as planned.
- 80% of the time, the action North takes the agent North.
- 10% of the time, North takes the agent West; 10% of the time, North takes the agent East.
- If there is a wall in the direction the agent would have been taken, the agent stays put.
The agent receives rewards each time step:
- Small ‘living’ reward each step (can be negative)
- Big rewards come at the end (good or bad)
Goal: Maximize sum of rewards
An MDP is defined by -
- A set of states s∈S
- A set of actions a∈A
- A transition function T(s,a,s’) . Also called the model or the dynamics. P(s’|s,a) = T(s,a,s’)
- The reward function R(s,a,s’)
- A start state
- Maybe a terminal state
- MDP’s are non-deterministic search problems
- “Markov” generally means that given the present state, the future and the past are independent.
- This is just like search, where the successor function could only depend on the current state (not the history)
- In the deterministic single-agent search problems, we wanted an optimal plan or sequence of actions, from start to goal.
- For MDPs we want an optimal policy Π* : S ⇒ A
- A policy Π gives an action for each state.
- An optimal policy is one that maxmizes expected utility if followed.
- An explicit policy defines a reflex agent.
- Its reasonable to maximize the sum of rewards.
- Its also reasonable to prefer rewards now to rewards later
- One solution: values of rewards decay exponentially.
How to discount?
Each time we descend a level. we multiply in the discount once.
Sooner rewards probably do have higher utility than later rewards. Also helps algorithm converge.
Example: discount of 0.5
U([1,2,3]) = 1*1 +2*0.5 + 3*o.25
What if the game lasts forever?
- Finite horizon: (similar to depth-limited search)
- Gives non stationary policies (Π depends on time left)
- Discounting: use 0<ϒ<1. Smaller ϒ means smaller “horizon”.
Absorbing state- guarantee that for every policy, a terminal state will eventually be reached.
How to solve MDPs?
The value (utility) of a state S:
V*(s) = expected utility starting in S and acting optimally.
(tells us how good each state is)
Q*(s,a) = expected utility starting out having taken action ‘a’ from state ‘s’ and (thereafter) acting optimally.
(‘s’ is fixed and we vary ‘a’)
The optimal policy:
Π*(s) = optimal action from state s = argmax Q*(s,a)
(its the action that achieves the maximum)
Values of States:
Fundamental operation : compute the (expectimax) value of the state.
- Expected utility under optimal action.
- Average sum of (discounted) rewards
- This is just what expectimax computed!
Recursive definition of value:
V*(s) = max Q*(s,a)
Q*(s,a) = max Σ T(s,a,s’)[R(s,a,s’) + V*(s)] — Bellman’s equation
Policy = map of states to actions.
Utility = sum of discounted rewards.
Values = expected future utility from a state (max node)
Q-values = expected future utility from a q-state (chance node)
Both value iteration and policy iteration compute the same thing (all optimal values)
In Value iteration -
- Every iteration updates both the values and (implicitly) the policy.
- We dont track the policy, but taking the max over actions implicitly recompute it.
In policy iteration -
- We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action and not all of them)
- After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
- The new policy will be better (or we’re done)
Alright that’s it for now! Thank you for spending your time. Cheers!