Reinforcement Learning Chapter 4: Dynamic Programming (Part 1 — Policy Iteration)

Numfor Tiapo
6 min readMar 4, 2023

--

Chapter 4 Series:

Code: https://github.com/nums11/rl

In the previous article, we defined the full Reinforcement Learning Problem as a finite Markov Decision Process. In this article, we’ll learn about our first set of solutions — Dynamic Programming Solutions.

Dynamic Programming

Dynamic Programming (DP) refers to a collection of algorithms that can be used to compute policies to solve a MDP given a perfect model of the environment.

  • Specifically, a perfect model of the environment means that the transition probabilities (the likelihood of ending up in a given state when taking an action from another state) as well as the reward model (the likelihood of getting a particular reward for being in a particular state) are known for the entire environment.

Dynamic programming methods are often limited in practice for 2 reasons

  1. Assumption of a perfect model that is often unavailable for most problems
  2. Computational expense

They are still, however, useful because they serve as a theoretical foundation on which most advanced RL methods are based.

  • Most advanced RL methods can be thought of as attempts to achieve the same effect as DP just with less computation and without the assumption of a perfect model.

The key idea behind DP is the use of value functions to organize and structure the search for good policies.

  • DP will be used to compute v* and/or q*. Once we have these optimal value functions, deriving an optimal policy is trivial.
  • DP algos are obtained by turning the previously shown Bellman equations into update rules for improving approximations of the desired functions.

Policy Evaluation

Consider the case where you have some policy π. How can we evaluate this policy? One way would be to get the state value function vπ for this policy, which would tell us how valuable this policy thinks each state is.

  • This is called iterative policy evaluation or the prediction problem

The intuition behind the policy evaluation algorithm is

  • Start with some random state value function v₀
  • Iteratively update the expected value of each of the states in v₀ using the previously shown Bellman equation for state value functions and the policy we are trying to evaluate
  • Every time we update all the state values, we consider that a new version of the value function (v₁, v₂, v₃, etc.)
  • As we keep changing the value function it will converge to vπ, or the actual state value function for our policy π
  • Since we are updating our value function “iteratively” until it reaches vπ, this is called iterative policy evaluation (IPE). The algo is shown below:
  • The updates done to the value function are called “expected updates” because they are based on the expected rewards the agent will receive. In other words, the agent never actually steps through the environment but rather uses its perfect model of the environment to estimate what rewards it will receive.
  • IPE can be done with 2 separate arrays or in place on 1 array as shown in the pseudocode
  • The algo runs until the state values stop changing under a threshold θ. Under this threshold, the algo has approximated vπ.

Policy Improvement

The main reason for computing the value function (vπ) for a policy is to help find better policies. Once we know the value function for a policy, how can improve on that policy?

Consider the case of an agent being in some state s and having access to some policy π and its state value function vπ. π already tells the agent which action to select but could one of the other actions be better?

  • One way to verify this is to choose some other action a that is not what π would choose and then continue to follow π afterward.
  • If this leads to a greater expected value than the current expected value for the state — vπ(s) — this implies that it’s better to select this action when in s than the action that π currently says to select.
  • In other words, changing the policy to select this other action a instead of the action that the policy currently says to select would mark an improvent to the policy.

We can encapsulate the value of selecting any action a from state s as qπ(s,a) where:

  • To calculate the expected value of taking the action a from state s, we look at all the other state the agent could end up in by taking this action.
  • For each of those states s’ we take the probability of ending up in s’ and multiply that by the reward we would get for ending up in s’ + the discounted value of being in s’.
  • By doing a summation of this value for all other possible states we get the value of taking action action a from state s or the action value.

This is evaluating a change in the policy at one state to one particular action, but we can extend this to consider changes at all states to all actions and then to select the action at each state that looks bests according to qπ(s,a). This is formalized here:

  • The new greedy policy π’(s) takes the action that looks best in the short term after one-step lookahead according to vπ.

This process of making a new policy that improves on an original policy by making it greedy with respect to the original policy’s value function is called policy improvement.

  • π’ will always be at least as good as π
  • policy improvement must give a strictly better policy except for when the original one is already optimal

Policy Iteration

Once a policy π has been improved using vπ to yield vπ’, we can compute vπ’ using IPE and improve it to yield π’’ and so on, andso forth.

  • E denotes a policy evaluation and I denotes a policy improvement

This method of using policy evaluation and then policy improvement is called policy iteration.

  • Policy Iteration is just a back-and-forth between Policy Evaluation and Policy Improvement
  • In IPE, V can be initialized to be the state value function for the previous policy which speeds up convergence.

In this article, we learned about the basics of Dynamic Programming and how Iterative Policy Evaluation and Policy Improvement can be combined into the policy iteration algorithm. In the next article I’ll show an implementation of this algorithm in Python on a simple environment.

--

--