Markov decision process: complete explanation of basics with a grid world example

13 min readDec 4, 2021

Markov decision process (MDP) is an important concept in AI and is also part of the theoretical foundation of reinforcement learning. In today’s story we introduce the basic concepts of MDP using a grid world example from the book Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig.

The code in this story is part of our MAD from scratch project where MAD stands for machine learning, artificial intelligence, and data science. The complete code used in this story can be found in this repo: https://github.com/clumsyhandyman/mad-from-scratch

Our version of Pac-Man game

Here we have a 4×3 grid world somewhat like Pac-Man. We have a robot starting at the left bottom corner and moving inside this 2D world to play a game.

Our robot can move in four directions: up, down, left, and right, exactly like a Pac-Man. Another similarity with Pac-Man is that our world is surrounded by non-passable walls. There are also non-passable walls inside the boundary represented by the black square.

The green diamond in the square at the upper right corner represents a finish line. If we reach this square, we win this game and earn a lot of points (+1 in this example).

In Pac-Man there are ghosts always trying to hurt you. In our game, we have a square with a red poison. If we step into this square, we lose the game and incur a lot of penalty (-1 in this example).

All the other white squares are passable squares. Each time we step into one of them, we lose a small amount of points (-0.04 in this example). If we move randomly with the hope of eventually reaching the green diamond by luck, we lose 0.04 points each move we wandered and thus cumulatively lose a lot of points. One way to think about this is the electricity consumed by our robot. Each move of the robot costs us a certain amount of money.

For simplicity, we assume our robot always starts at the left bottom corner as represented by the yellow Pac-Man.

In summary, when playing this game, we want to earn the +1 points as quickly as possible without paying -0.04 too many times along the way and we definitely want to avoid ending the game in the red poison with a -1 penalty.

Definition of MDP

In Artificial Intelligence: A Modern Approach, MDP is defined as

a sequential decision problem for a fully observable, stochastic environment with a Markovian transition model and additive rewards is called a Markov decision process, or MDP, and consists of a set of states (with an initial state s₀); a set ACTIONS(s) of actions in each state; a transition model P(s′| s, a); and a reward function R(s).

Let’s break down some keywords in this definition using our example.

Agent refers to the robot or the Pac-Man or whatever we put inside this world that looks around for possible moves, thinks about the pros and cons of possible moves, makes a decision of the next move, and executes the move.

Environment refers to what the world looks like and the rules of the world. In our example, environment includes how big our world is, where the walls are placed, what happens if we bump into a wall, where is the diamond and if we reach there how much points we get, where is the poison and if we reach there how much points we lose, and so on.

State in MDP refers to the state of our agent, not state of our environment. In MDP the environment is given and does not change. In contrast, our agent can change state from previous move to current move. In our example, our agent has 12 states which represent the 12 squares that our agent can possibly be in. Technically our agent is not able to be in the black square, but for simplicity we still count the black square as a state. In MDP we use s to denote state and in our example we call them s = 0, 1, 2, …, 11.

Action refers to the moves can be taken by our agent in each state. The set of possible actions is not necessarily the same for each state. In our example, we have four possible actions A(s) = {up, down, left, and right} for every state s. We use a to denote actions.

States and rewards of the gird world example

Reward refers to the points we receive when we move into a state. In our example, that is +1, -1, or -0.04 for different states. Reward is dependent on state only and this function from state to reward is denoted as R(s). We receive the same reward when entering a certain state regardless of our previous state. For instance, when s = 1, r = R(s) = R(1)= -0.04. No matter we move right into it from s = 0 or we move left into it from s = 2, we receive the same -0.04 reward.

Additive rewards mean the rewards from different moves are cumulative. In our example, our final points (total rewards) are the points received from the diamond or the poison and the points received from passable squares along the way added together. Not only the reward received in the final square but also all the small rewards received along the way matter. In MDP, we assume we can add these rewards together (as opposed to multiple together, for example).

Transitional model tells us which next state we will move into if we choose a certain action at a certain state. This is usually represented as a table P. If we are at a state s and we choose an action a, then the probability of s′ being the next state is P(s′| s, a).

If we assume our world is deterministic, our intended moving direction is always fulfilled. For example, if we are at s = 10 and we choose a = UP, we are guaranteed to move into a new state s′ = 6.

This can be written as

As a probability distribution, the sum of P(s′| s = 10, a = UP) of all 12 s′ always equals to 1.

Stochastic means our world is not deterministic as we assumed above. From a certain state, if we choose the same action, we are not guaranteed to move into the same next state. In our game, we can think of this as our robot somehow has some probability of malfunctioning. For instance, if it decides to go left, it has a high possibility to actually go left. Nevertheless, there is a small possibility, no matter how tiny it may be, that it goes wild and moves into directions other than left.

Following the example from the book, the probability of actually moving in the intended direction is 0.8. There is a 0.1 probability of moving 90 degrees left to the intended direction and another 0.1 probability of moving 90 degrees right to the intended direction.

For example, if we are at s = 10 and we choose a = UP, we have a probability of 0.8 to move into the intended new state s′ = 6. We also have a probability of 0.1 to arrive at s′ = 9 and a probability of 0.1 to arrive at s′ = 11.

This can be written as

This is the transition model when s = 10 and a = UP. Now for every pair of state s and action a, we can write out the probability of arriving at each new state like this. Combining all these together, we get the transition model P. We have 12 possible states and 4 possible actions. Therefore, the dimension of our transitional model P is 12×4×12, with a total of 576 probability values.

Fully observable means our agent somehow knows what the world looks like and where itself currently is in that world. In some sense, we can say our agent has a God’s eye view. It has access to all the knowledge needed to find the optimal decision during each move. The knowledge here refers to our reward function R(s) and transitional model P(s′| s, a).

Sequential means our current situation is affected by previous decisions. By the same token, all future situations are affected by our current decision. For instance, if we start from s = 8 and decide to move up as our first move, then we arrive at s = 4. There is no way we can reach s = 10 in the next move. However, if we choose move right as our first move from s = 8, then we arrive at s = 9. Now we can reach s = 10 by moving right. In other words, whether we can reach s = 10 in the second move depend on what we choose in the first move.

Markovian means our world is memoryless. This may seem in contrary to our definition of sequential, but they actually have completely different meanings. Sequential means whether we can reach s = 10 in our second move depends the choice we made in our first move.

But what does memoryless mean?

It means no matter how we reached the state s = 10, no matter whether we reached it from s = 9 or s = 11 or s = 6, once we are in that state, we always face the same situation with the same set of possible actions. If we make a decision, what will happen is not dependent on what happened in the past. In other words, it just means R(s) and P(s′| s, a) do not change during the entire game regardless of what you did in the game.

When we reach s = 10 for the 100th time, we face the same choices of actions as if we reach it for the first time. It is déjà vu forever. If we choose a certain action at this state, the outcome always follows the same probability distribution. If a certain action is the optimal action for the state s = 10 when we reach this state for the first time, then this action is always the optimal action for this state.

Code implementation of a grid world

We use a csv file to represent the map of our world, with 0 as white passable squares, 1 as green squares, 2 as red squares, and 3 as black squares. For our example, the csv file looks like this:

0,0,0,1
0,3,0,2
0,0,0,0

We use the following code to implement the grid world example. The reward of each type of squares is represented by the dictionary reward.

Reward function

As we explained above, reward function R(s) is only dependent on state s. We use the following code to generate the results of the reward function R(s) for each s.

Transitional model

Let’s recap the rules of our game. If we bump into a wall, we are bounced back and stay where we were. If we want to move up or down, we have 0.8 probability to actually go up or down, 0.1 probability to go left, and 0.1 probability to go right. Similarly, if we want to move left or right, we have 0.8 probability to actually go left or right, 0.1 probability to go up, and 0.1 probability to go down.

We use the following code to generate the transitional model.

The transitional model can be visualized like this:

Transitional model of gird world example

Let’s check if our transitional model makes sense. For instance, if we look at s = 8, when action is down, we have 0.8 probability to go down and then hit a wall and bounce back to s′ = 8, 0.1 probability to go left and then hit a wall and back to s′ = 8, and 0.1 probability to go right and reach s′ = 9. Therefore, s′ = 8 corresponds to p = 0.8 + 0.1 = 0.9 and s′ = 9 corresponds to p = 0.1.

Policy

Policy refers to the instructions of what action to be taken in each state. It is usually denoted as π and it is a function of s.

For instance, if s = 2 and π(s) = π(2) = RIGHT, then whenever out agent is in state s = 2, the policy says it should choose the action to go right. Therefore, the agent will execute the action of RIGHT every single time it is in state s = 2. Whether it can actually go to the square on the right is another story which depends on the probability in transition model. The point is, regardless of the outcome, as long as the policy says you should go right, our agent will choose to go right every single time.

Our starting point of looking at policy is to simply generate a random policy and visualize it.

Let’s say this random process generates the following policy.

We can explicitly write this policy out: π(0) = RIGHT, π(1) = RIGHT, π(2) = LEFT, … π(11) = UP.

By first glance, this randomly generated policy does not seem to work. For instance, we start at s = 8, policy says to go right; we move right into s = 9, then policy says go down. However, the bottom of s = 9 is a wall. We just hit that wall and bounce back to s = 9. It seems we stuck in s = 9.

The reason we won’t be trapped in s = 9 forever is that we are in a stochastic world. Even the policy tells us to go down and we follow that policy, there is still a probability of 0.1 that we end up in s = 8 and a probability of 0.1 that we end up in s = 10.

Execute a policy

Now we at least have a policy even though it may not seem to be very good. Let’s see what happens if we execute it.

If we are currently in state s, we have a = π(s) from our policy. Since our world is stochastic, we use numpy.random.choice to choose s′ based on the probability distribution given by P(s′| s, π(s)).

Because this process is stochastic, different run may have different results. Here let’s just try it once.

The history of states of this particular run is [8, 9, 9, 9, 9, 10, 10, 6, 10, 11, 7]. We start with 8, with that 0.8 probability, we move into 9; then we try to go down for three times and are bounced back to 9; then with that 0.1 probability, we move into 10, and then 11, and finally and unfortunately 7. Along this path we use 10 moves. The total rewards we receive is 10 times -0.04 plus -1 equals -1.4.

Of course this is just one run and we may be unlucky in this single run. Let’s use a Monte Carlo approach to evaluate how good or bad this policy is. For this example, let’s just repeatedly execute the policy for 10,000 times and record the total rewards we received in each run.

Results of executing random policy #1 for 10,000 times

During these 10,000 repeated run, the highest total rewards we got is -1.16, and the lowest is around -8.5. On average, we achieved -1.7 per game. Obviously this is not very good. Let’s check some other policies.

Execute more policies

Here’s the results of our random policy #2.

Results of executing random policy #2 for 10,000 times

The highest is 0.8 which in fact is as good as we can get. The average is -1.2 which is better than the previous random policy #1. However, the lowest is -55 which is way worse than -8.5 of random policy #1.

Let’s try another one.

Results of executing random policy #3 for 10,000 times

This one seems pretty good. The highest is 0.8, the lowest is -1.64, and the average is 0.7. Overall, this one seems much better than the previous two.

How to find an optimal policy?

Looking back at our three random policies, we can say that #3 seems to be the best one. Is there another one that is even better? If so, how do we find it?

Obviously, we have a finite number of possible policies. In our example, we have 9 states needing an assigned action and we have 4 possible actions for each state. That is a total of 4⁹ policies. Each one of these policies has a certain expected total reward if we execute it repeatedly for many times. Therefore, one (or more if there are ties) of them will have higher expected total reward than the others. Such policy is therefore an optimum policy π*.

To find the optimal policy, one naïve way is to try all 4⁹ policies and find the best one. Obviously this is not practical at all. In practice, we have two methods: policy iteration and value iteration. We will talk about these two methods in future stories. Stay tuned!