Reinforcement Learning Chapter 4: Dynamic Programming (Part 2 — Policy Iteration in Grid World)
Chapter 4 Series:
- Part 1 — Policy Iteration
- Part 2 — Policy Iteration in Grid World
- Part 3 — Value Iteration
- Part 4 — Asynchronous DP & Generalized Policy Iteration
Code: https://github.com/nums11/rl
In the previous article, we learned about Dynamic Programming and the Policy Iteration algorithm. In this article, we’ll look at a python implementation of the algo in a simple RL environment.
Grid World
The book defines a simple environment called Grid World.
- The agent has 4 possible actions: up, down, left, and right
- Actions that would take the agent off of the grid leave it in the same state
- The agent moves until it reaches one of the goal states in either the top-left or bottom-right corners.
- There is a -1 reward for every step which punishes long trajectories to the goal and favors shorter ones.
Here is a python implementation of Grid World with some small tweaks
- Agent starts in the top right corner
- +5 reward for reaching the goal state
Policy Iteration Agent
Here is a python implementation of a Policy Iteration Agent
- The environment model is encoded through the “State” class which has knowledge about the transitions from each state
Testing
The agent can easily be tested on the Grid World environment with the following code
Results for the agent are shown below:
- You can see that after 5 iterations the agent is able to estimate the optimal value function and derive the optimal policy to the goal states.
In the next article, we’ll learn about an alternative Dynamic Programming algorithm that addresses some of the shortcomings of Policy Iteration.