Reinforcement Learning Chapter 4: Dynamic Programming (Part 4 — Asynchronous DP & Generalized Policy Iteration)
Chapter 4 Series:
- Part 1 — Policy Iteration
- Part 2 — Policy Iteration in Grid World
- Part 3 — Value Iteration
- Part 4 — Asynchronous DP & Generalized Policy Iteration
Code: https://github.com/nums11/rl
In the last few articles, we’ve learned about Dynamic Programming Methods and seen how they can be applied to a simple RL environment. In this article, I’ll discuss another modification to these algos and how DP plays a role in more advanced RL methods.
Summary
Thus far, we have learned 2 basic Dynamic Programming methods: Policy Iteration and Value Iteration.
Policy Iteration
- Starts with a state value function v and a policy π
- Iteratively evaluates v using the Bellman update for state value functions until v is an accurate representation of vπ
- Improves π by making it greedy with respect to v
- Repeats until convergence
Value Iteration
- Starts with a state value function v
- Iteratively evaluates and improves on v by making greedy expected updates — estimating the value of a state to be based on the action that will lead to maximal return
- Repeats until convergence. Once converged, the optimal policy is trivially derived from v
Asynchronous Dynamic Programming
A major drawback to the DP algo’s we’ve seen so far is that they require a sweep of the state set (looking at all of the next possible states before updating the value of one state).
- In some environments (e.g. backgammon with 10²⁰ states) this can be an infeasible approach
Asynchronous DP algorithms find many different ways to update the value of the current state without looking at all of the next possible states.
- Could look at only one state before updating the current state value
- Could look at some number n states before updating the current state value
- Could update the value of some particular states multiple times before the value of other states are even updated once
Avoiding this sweep doesn’t mean that there is less computation but rather that the algorithm doesn’t need to get locked into a long sweep before it can make progress on improving a policy. The flexibility of asynchronous DP algorithms can allow an algorithm to progress substantially faster and in a more focused manner in environments with large state spaces.
Generalized Policy Iteration
Generalized Policy Iteration is used to describe the interaction between policy evaluation and policy improvement regardless of the granularity between them.
- In regular Policy Iteration, there is a clear separation of these 2 processes
- In Value Iteration, only a single iteration of policy evaluation is performed between each improvement
- In asynchronous DP methods, they can be interleaved at an even finer grain
Almost all RL methods are described as GPI — all have identifiable policies and value functions with the policies being improved with respect to the value functions and the value functions being driven to accurately represent the policy.
- When the processes stabilize — a policy is found that is greedy with respect to its own value function — then the Bellman Optimality Equation holds and q* and v* have been found
2 key components of DP methods are that they require a complete model of the environment and that they bootstrap — they update their estimates based on other estimates. In the next chapter, we take a look at some other methods that don’t bootstrap nor require a complete model of the environment.