# Markov Decision Process

**Grid World**

A maze like problem:

- The agent lives in a grid.
- Walls block the agent’s path.

Noisy movement: actions do not always go as planned.

- 80% of the time, the action North takes the agent North.
- 10% of the time, North takes the agent West; 10% of the time, North takes the agent East.
- If there is a wall in the direction the agent would have been taken, the agent stays put.

The agent receives rewards each time step:

- Small ‘living’ reward each step (can be negative)
- Big rewards come at the end (good or bad)

Goal: Maximize sum of rewards

An MDP is defined by -

- A set of states s∈S
- A set of actions a∈A
- A transition function T(s,a,s’) . Also called the model or the dynamics. P(s’|s,a) = T(s,a,s’)
- The reward function R(s,a,s’)
- A start state
- Maybe a terminal state
- MDP’s are non-deterministic search problems

- “Markov” generally means that given the present state, the future and the past are independent.
- This is just like search, where the successor function could only depend on the current state (not the history)
- In the deterministic single-agent search problems, we wanted an optimal plan or sequence of actions, from start to goal.
- For MDPs we want an optimal policy Π* : S ⇒ A
- A policy Π gives an action for each state.
- An optimal policy is one that maxmizes expected utility if followed.
- An explicit policy defines a reflex agent.

**Discounting:**

- Its reasonable to maximize the sum of rewards.
- Its also reasonable to prefer rewards now to rewards later
- One solution: values of rewards decay exponentially.

How to discount?

Each time we descend a level. we multiply in the discount once.

Why discount?

Sooner rewards probably do have higher utility than later rewards. Also helps algorithm converge.

Example: discount of 0.5

U([1,2,3]) = 1*1 +2*0.5 + 3*o.25

What if the game lasts forever?

Solution:

- Finite horizon: (similar to depth-limited search)
- Gives non stationary policies (Π depends on time left)
- Discounting: use 0<ϒ<1. Smaller ϒ means smaller “horizon”.

Absorbing state- guarantee that for every policy, a terminal state will eventually be reached.

**How to solve MDPs?**

The value (utility) of a state S:

V*(s) = expected utility starting in S and acting optimally.

(tells us how good each state is)

Q*(s,a) = expected utility starting out having taken action ‘a’ from state ‘s’ and (thereafter) acting optimally.

(‘s’ is fixed and we vary ‘a’)

The optimal policy:

Π*(s) = optimal action from state s = argmax Q*(s,a)

(its the action that achieves the maximum)

**Values of States:**

Fundamental operation : compute the (expectimax) value of the state.

- Expected utility under optimal action.
- Average sum of (discounted) rewards
- This is just what expectimax computed!

Recursive definition of value:

V*(s) = max Q*(s,a)

Q*(s,a) = max Σ T(s,a,s’)[R(s,a,s’) + V*(s)] — Bellman’s equation

**Quantities:**

Policy = map of states to actions.

Utility = sum of discounted rewards.

Values = expected future utility from a state (max node)

Q-values = expected future utility from a q-state (chance node)

Both value iteration and policy iteration compute the same thing (all optimal values)

In Value iteration -

- Every iteration updates both the values and (implicitly) the policy.
- We dont track the policy, but taking the max over actions implicitly recompute it.

In policy iteration -

- We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action and not all of them)
- After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
- The new policy will be better (or we’re done)

Alright that’s it for now! Thank you for spending your time. Cheers!