Policy Iteration — Easy Example

Pe Supavish
3 min readFeb 4, 2019

--

Policy Iteration is a way to find the optimal policy for given states and actions

Let us assume we have a policy (𝝅 : S → A ) that assigns an action to each state. Action 𝝅(s) will be chosen each time the system is at state s.

The idea of policy iteration

  1. Evaluate a given policy (eg. initialise policy arbitrarily for all states s ∊ S) by calculating value function for all states s ∊ S under the given policy
Emilio Frazzoli, 2010

Value function = the expected reward collected at the first step + expected discounted value at the next state

2. Improve policy : find a better action for state s ∊ S

Emilio Frazzoli, 2010

3. Repeat step 1,2 until value function converge to optimal value function

Policy evaluation example

Find the optimal policy for a planing problem (4x4 grid)

Three states s(x,y) : s(2,2) s(2,3) s(3,2)
Four actions 𝝅(s): go up, go down, go left, go right

For a given action 𝝅(s) under the policy, the probability that action will be done is 0.70. and the other actions will have the probability at 0.10.
If an agent is at the goal s(3,2), the agent will stop with probability of 1.

Let’s a discount factor (𝛄) equals 0.9.

Step1: Evaluate a given policy

Start with a simple policy 𝝅 : Always go right

Probability of actions for the given policy

Calculate value function for a simple policy 𝝅

Solving, we get:

V(3,2) = 10
V(2,2) = 9
V(2,3) = 4.265

Step2: Improve policy

A simple policy :

Update it :

Step3: Repeat it until convergent

Repeat step 1,2 until everything isn’t change.

I hope these blogs has been useful. If I missed anything, please let me know.

Ref: Principles of Autonomy and Decision Making, Emilio Frazzoli 2010

Love you Dad and Mom

--

--