Policy Iteration — Easy Example
Policy Iteration is a way to find the optimal policy for given states and actions
Let us assume we have a policy (𝝅 : S → A ) that assigns an action to each state. Action 𝝅(s) will be chosen each time the system is at state s.
The idea of policy iteration
- Evaluate a given policy (eg. initialise policy arbitrarily for all states s ∊ S) by calculating value function for all states s ∊ S under the given policy
Value function = the expected reward collected at the first step + expected discounted value at the next state
2. Improve policy : find a better action for state s ∊ S
3. Repeat step 1,2 until value function converge to optimal value function
Policy evaluation example
Find the optimal policy for a planing problem (4x4 grid)
Three states s(x,y) : s(2,2) s(2,3) s(3,2)
Four actions 𝝅(s): go up, go down, go left, go right
For a given action 𝝅(s) under the policy, the probability that action will be done is 0.70. and the other actions will have the probability at 0.10.
If an agent is at the goal s(3,2), the agent will stop with probability of 1.
Let’s a discount factor (𝛄) equals 0.9.
Step1: Evaluate a given policy
Start with a simple policy 𝝅 : Always go right
Probability of actions for the given policy
Calculate value function for a simple policy 𝝅
Solving, we get:
V(3,2) = 10
V(2,2) = 9
V(2,3) = 4.265
Step2: Improve policy
A simple policy :
Update it :
Step3: Repeat it until convergent
Repeat step 1,2 until everything isn’t change.
I hope these blogs has been useful. If I missed anything, please let me know.
Ref: Principles of Autonomy and Decision Making, Emilio Frazzoli 2010
Love you Dad and Mom