My Journey Into Reinforcement Learning (Part 3) — Dynamic Programming
Welcome to another chapter in my quest to understand reinforcement learning. To continue the learning from Part 2, we’ll be looking at the concept of dynamic programming. As usual, I’ll pop the resources that guided me at the bottom of this post.
Dynamic programming refers to a collection of algorithms revolving around solving Markov decision processes. They sweep over every possible state, and they compute optimal policies, at the cost of being computationally expensive; they also assume a perfect model. There are more efficient ways of solving MDPs, but understanding the fundamentals of dynamic programming is still important.
To assist us in this process, let’s get familiar with a simple land called gridworld.
In gridworld, there are 14 non-terminating states, and the grayed out boxes signify the goals, or terminal states. There are four possible actions: up, down, left, right. If an action results in the agent moving off the grid, the agent doesn’t move at all, but still takes the reward for the action. The reward is negative for each transition until the terminal state is reached, making the expected reward function r(s, a, s’ ) = -1 for all states s, s’ and actions a.