Reinforcement Learning Chapter 4: Dynamic Programming (Part 2 — Policy Iteration in Grid World)

2 min readMar 4, 2023

Chapter 4 Series:

In the previous article, we learned about Dynamic Programming and the Policy Iteration algorithm. In this article, we’ll look at a python implementation of the algo in a simple RL environment.

Grid World

The book defines a simple environment called Grid World.

The agent has 4 possible actions: up, down, left, and right
Actions that would take the agent off of the grid leave it in the same state
The agent moves until it reaches one of the goal states in either the top-left or bottom-right corners.
There is a -1 reward for every step which punishes long trajectories to the goal and favors shorter ones.

Here is a python implementation of Grid World with some small tweaks

Agent starts in the top right corner
+5 reward for reaching the goal state

Policy Iteration Agent

Here is a python implementation of a Policy Iteration Agent

The environment model is encoded through the “State” class which has knowledge about the transitions from each state

Testing

The agent can easily be tested on the Grid World environment with the following code

Results for the agent are shown below:

You can see that after 5 iterations the agent is able to estimate the optimal value function and derive the optimal policy to the goal states.

In the next article, we’ll learn about an alternative Dynamic Programming algorithm that addresses some of the shortcomings of Policy Iteration.

Reinforcement Learning Chapter 4: Dynamic Programming (Part 3 — Value Iteration)

In the previous articles, we learned about the Policy Iteration algorithm and saw how to implement it and use it on…

medium.com