Reinforcement learning: model-based ADP learner with code implementation

14 min readJan 29, 2022

In today’s story we focus on building a model-based adaptive dynamic programming (ADP) agent to learn an MDP. As we explained in great details in previous stories, we can use either policy iteration or value iteration to solve an MDP if we know its reward function and transition model. However, in real-world applications, this is not always the case. As a result, directly using good old policy iteration or value iteration may not be possible. An ADP learner, which is today’s topic, provides one way to solve an MDP even if we do not know its reward function or transition model.

The code in this story is part of our MAD from scratch project where MAD stands for machine learning, artificial intelligence, and data science. The complete code used in this story can be found in this repo: https://github.com/clumsyhandyman/mad-from-scratch

MDP as a black box interacting with an agent

Let’s continue with the grid world example we used in previous stories. In policy iteration and value iteration, we have access to reward function and transition model. In other words when we solve the MDP what we are doing is planning. We know everything about the world, and we are just planning what we should do in this world.

Reward function and transition model may be not accessible in real-world applications

In contrast, if we do not have access to reward function or transition model, we are not able to do planning. We are essentially standing in the middle of a mist not able to see anything beyond our current state. Rather than planning, what we need to do is learning. Through learning, we are trying to figure out what the world looks like and what we should do in this world.

In this case, the MDP environment is essentially behaving like a black box. Our agent inputs a pair of state and action (s, a) and the MDP outputs a pair of reward and next state (r, s′).

In our grid world example, the interaction is something like this:

MDP: I want to play a game. You must choose an action from 1 to 4 in each step.
Agent: OK.
MDP: You are starting at state 8. Which action do you choose?
Agent: I choose action 4.
MDP: You are in state 9 now. You received a reward of -0.04.
Agent: I choose action 2.
MDP: You are in state 9 now. You received a reward of -0.04.
Agent: I choose action 4.
MDP: You are in state 10 now. You received a reward of -0.04.
…
MDP: You are in state 2 now. You received a reward of -0.04.
Agent: I choose action 2.
MDP: You are in state 3 now. You received a reward of 1. You win the game.

From this interaction, the agent has no idea state 9 is on the right of state 8. In fact, the agent does not even know that each state is representing a location in a grid world. By the same token, the agent does not know what each action means either. All the agent knows is the sequence s = 8, a = 4, r = −0.04, s = 9, a = 2, r = −0.04, s = 9, a = 4, r = −0.04…

Maybe this is a Pac-Man game moving in a 2D grid world with actions representing moving north, south, east, and west. Maybe this is a vacuum robot moving in your home with actions representing charge, vacuum, empty bin, and so on. Maybe this is a stock trading program with actions representing buy, sell, hold, and so on.

The point is our agent will never know what the states and actions actually mean. Our agent only knows a sequence of states, actions, and rewards. This [s, a, r, s, a, r, s, a, r…] is the only game our agent knows. How should our agent figure out this game and eventually win this game? This is the main question reinforcement learning tries to answer.

Learning r(s) and p(s′| s, a) as a supervised problem

Given a bunch of sequences such as [s₁, a₁, r₁, s₂, a₂, r₂, s₃, a₃, r₃…], how can we estimate the reward function r(s) and transition model p(s′| s, a)?

The reward function is straightforward. As r(s) is a function of state only and we already have pairs of r and s in the given sequences, we can directly determine the correlation between r and s. For instance, for the given sequences [s₁, a₁, r₁, s₂, a₂, r₂, s₃, a₃, r₃…], at time t = 1, we stood at s₁ and chose the action a₁. As a result, we received a reward of r₁ and arrived at s₂. In other words, arriving at s₂ is associated with a reward of r₁. Therefore, the reward of state s₂ is r(s₂) = r₁. Similarly, the reward of state s₃ is r(s₃) = r₂, and so on.

The transition model is not that straightforward but still quite easy. From the given sequences, we have groups of (s, a, s′). By counting the occurrences of each (s, a, s′), we can estimate their probability, which is p(s′| s, a) by definition. For instance, s₁ and a₁ are associated with s′ = s₂ which accounts for one occurrence of (s₁, a₁, s₂), s₂ and a₂ are associated with s₃ which accounts for one occurrence of (s₂, a₂, s₃), and so on.

From all the given sequences, the number of a certain (s, a, s′) is counted as n(s, a, s′). Then the transition model can be estimated as

For instance, if there are 4 states [s₁, s₂, s₃, s₄], and the occurrences of each s′ for (s = s₁, a = a₁, s′) are the following:

Then,

The transition model for (s = s₁, a = a₁) is then

By repeating this for every pair of (s, a) we can get the estimation of every element of the matrix of p(s′| s, a).

After we have an estimated reward function and an estimated transition model, we have an estimated whole picture of the MDP. Since we are trying to estimate the model of the MDP, this type of approaches is usually called model-based. We then solve it using value iteration or policy iteration the same way as we explained in previous stories. As value iteration and policy iteration are both some versions of dynamic programming, this approach is called adaptive dynamic programming (ADP).

Choose next action: exploitation or exploration?

In theory, using the procedure we described above, we can get the estimation of r(s) and p(s′| s, a) as accurate as possible if we have an infinite amount of [s, a, r, …] input sequences. Nevertheless, infinite amount of input sequences is apparently impractical. The best we can get is then a reasonably large amount of input sequences. From these sequences, we can get enough samples of every pair of (s, a) to estimate p(s′| s, a) and samples of every s to get r(s). This indicates we should visit each and every pair of (s, a) for a large amount of times. To ensure this, in theory, we should make each action to have a possibility to be chosen as the next action. This is called exploration and is usually done by repeatedly playing the game and choosing actions randomly.

However, exploration is not free. For instance, in our grid world example, generating one sequence means playing this game once. Therefore, generating a large amount of sequences means repeatedly playing this game for a large number of times. Intuitively, we will not perform very well at the beginning and thus we will lose a lot of these games, receiving a penalty point of -1 at the end of each of these games. As a result, we will incur a lot of penalties.

Let’s see other examples in real life. Learning how to drive can be treated as learning an MDP: during this situation (state) what should a driver do (action). To learn this MDP, we need to explore it repeatedly. However, exploring some of these pairs of state and action is very costly and dangerous. After experiencing one or two occurrences of these dangerous situations, the next time we are in these situations, we should choose the safe action we currently know instead of exploring random actions. Similarly, we can think stock market as an MDP: during a certain market condition (state) what should an investor do (action). Exploring this MDP sometimes can be extremely costly. In these cases, we have the initiative to utilize the knowledge we have learned so far as soon as possible to avoid excessive loss. This is called exploitation.

Now we have a dilemma: exploration vs exploitation. If we keep doing exploration without ever doing exploitation, we are gaining a lot of knowledge, but we never utilize that knowledge. As a result, during this exploration, we may lose a lot of games and incur a lot of cost or penalty. Also, what is the point of learning knowledge if we do not have a chance to use that knowledge?

In contrast, if we choose to do just a little exploration and then immediately change into exploitation mode, we are utilizing the knowledge we learned but that knowledge may not be complete or accurate. As a result, during the following exploitation, we have a high risk of committing to a sub-optimal solution and sticking there forever.

To balance exploration and exploitation, a practical way is to emphasize exploration at the beginning and gradually change to exploitation towards the end. For instance, if we are allowed to play 100 games of our grid world Pac-Man, we should focus on randomly exploring the game at the beginning games, like game 1 to 30. At that stage, we do not know much about the game, and we should explore as much as we can.

Then for game 31 to 70, maybe we should do half exploration and half exploitation. We have learned something from the first 30 games, but we are not that sure. Therefore, we utilize that knowledge roughly half of the times and we keep exploring and learning new stuff the other half of the times.

Finally for game 71 to 100, we may assume we have learned enough, and it is time for us to win the game. We utilize the knowledge we learned in the previous 70 games and choose the best action we know of. Our target is to win these remaining games and gain enough rewards to hopefully cover up the losses we incurred in previous games.

Following this idea, in practice, we can use several methods to implement this transition of emphasis from exploration to exploitation. For instance, inspired by the concept of simulated annealing, in each step, we have a probability to do exploration by choosing a random action. Otherwise, we do exploitation by choosing the best action we know so far. By setting this probability very high at the beginning and gradually reduce it towards zero using methods such as geometric decay, we can achieve a smooth transition of focus from exploration to exploration.

Pseudo-code for model-based ADP learner

Our learner is an agent interacting with the environment which is a black box MDP in this case. As an agent, our learner needs to be able to percept as well as actuate. By percept we mean taking in inputs of information and by actuate we mean producing an output of actions.

The input of the percept function of our learner is a set of (s, a, r, s′). This tells our learner the following information: if you stand at state s and choose action a, you receive a reward of r and arrives at state s′. The task of our learner is then to analyze what this information indicates and incorporate it into its knowledge, which is the estimated reward function and transition model.

Pseudo-code for percept function of ADP learner

The input of the actuate function of our learner is the current state s. We are asking our learner a question: hey your are at state s, what action do you choose? The task of our learner is then to decide the proper action a to return, balancing exploration and exploitation during different stages of learning.

Pseudo-code for actuate function of ADP learner

In the percept function, our learner updates the estimated transition model so that they are closer and closer to the true model. However, the policies are not updated accordingly. If there is no policy update, we are exploring forever. To make exploitation possible, our learner needs to update policy regularly, for example, after each step or each game. To do that, our learner has a separate function to update the policy. Each time the policy is updated, the probability threshold ϵ is multiplied by the decay factor ξ.

The learner interacts with the MDP black box in the following way: during each step in a game (episode), the learner percepts the information and then actuates. After each game, the learner updates the policy which becomes better and better as the learner learns more and more episodes. To monitor the learning process, the learner outputs the reward achieved in each episode.

Implement ADP learner in Python

The ADP learner is implemented as a class with percept, actuate, and policy update functions.

The learning interaction with a MDP black box is also implemented as a class. The problem is a MDP grid world, and the learner is an ADP learner.

For instance, the following code can be implemented to solve our 3 × 4 grid world example. The probability threshold starts at 1.0 and is multiplied by a decay factor of 0.9 each time the policy is updated.

The learning process can be visualized in the following figure: the bottom subfigure plots the reward received in each game from game 1 to game 100 and the top subfigure plots the sum of rewards before a certain game. For instance, in game 1 we received around 0.25, in game 2 we received -2.5, and so on. Sum the reward of game 1 to game 20, we accumulated a total reward of around 0, while sum the reward of game 1 to game 40, we accumulated a total reward of around 20, and so on.

At beginning, we are learning the MDP with a heavy focus on exploration. As a result, we tend to lose the game and receive negative reward. As we play the game more and more, we eventually figure out the MDP and start to play the game optimally. Therefore, the curve of total reward shows a V-shape, losing at the beginning and winning towards the end.

In our policy iteration story, we evaluated the optimal policy and concluded that its expected reward is around 0.7. In other words, if we know the whole MDP in advance and play the game optimally for 100 times, the expected total reward is around 70. Using our ADP learner, without know anything about the MDP at the beginning, we still achieved a total reward of around 60 at the end. This shows our ADP learner can learn this MDP grid world effectively.

Effects of exploration-exploitation dilemma

In our algorithm, the decay factor ξ adjusts how fast we transfer from exploration to exploitation. At beginning, assuming ϵ = 1.0, we are 100% focusing on exploration. After the first game, ϵ = ξϵ = ξ, indicating we have a probability of ξ to explore and a probability of 1−ξ to exploit. By the same token, after n games, ϵ = ξ ⁿ, indicating the probability of exploration is ξ ⁿ.

Let’s see if ξ = 0.9, after 100 games,

In this case, the probability of exploration is very small, indicating we almost always choose exploitation rather than exploration.

In contrast, if ξ = 0.99, after 100 games,

In this case, we have a probability of 36.6% choosing exploration and the rest 63.4% choosing exploitation. We prefer exploitation but we still have the possibility to continue exploration.

Our good old 3 × 4 grid example seems to be too small to show any significant differences when using different decay factors. Let’s use a bigger example. This grid world maze is also adopted from the book Artificial Intelligence A Modern Approach by Stuart Russell and Peter Norvig.

To make an easier comparison with our previous example, we implement similar rules: red square is a losing ending state with a reward of -2.5, green square is a winning ending state with a reward of +10.0, and all other white squares are passable states with a reward of -0.04.

The stochastic of the MDP is the same as the previous example: a probability of 0.8 executing as intended, a probability of 0.1 to go 90 degrees clockwise, and a probability of 0.1 to go 90 degrees counterclockwise.

As this 19×32 grid world is much bigger than our previous 3×4 world, we expect our learner requires more games to learn it. Let’s see what happens when we train our ADP learner with 1000 games in this bigger grid world.

When the decay factor is set as 0.99, our ADP learner seems to work pretty well. The expected reward for a single game is around 6.9 (obtained by cheating using policy iteration or value iteration), and our ADP learner achieved a total reward of around 4000 out of 1000 games, resulting an average reward of around 4 per game.

However, if the decay factor is too large, for instance 0.999, our learner focused too much on exploration. As a result, the left part of the V-shapes curve of total rewards is very wide. In short, our learner wasted too many games on unnecessary explorations, and the final reward is around −20,000 which is not so good comparing to 4,000 in the previous case.

In contrast, if the decay factor is too small, for instance 0.9, then our learner did not spend enough games on exploration. Rather than that, our learner prematurely committed to a solution, which is sub-optimal. In other words, our learner jumped into exploitation way too early. As a result, the left part of the V-shape curve of total rewards is extremely narrow, and the final reward of around −15,000 is not satisfying either.

This simple comparison shows the importance of exploration-exploitation dilemma. Properly balancing this is one of the key requirements of successful implementations of reinforcement learning.

Model-based vs model-free

Our ADP learner is a model-based method particularity because it estimates the transition model p(s′| s, a) and utilizes this estimated model when updating policy. This method seems to work pretty well in our grid worlds and may work properly in many other real-world applications. More specifically, model-based methods tend to work well if the transition model is well defined, like in Pac-Man, chess, or other games. In contrast, if the transition model is usually unknown or very hard to know, such as robot in real world or stock market, then model-based methods may not be a good choice. These cases are usually handled by model-free methods in which the transition model is not required. We will cover model-free methods in future stories. Stay tuned!