Markov Decision Process(MDP) Simplified

5 min readSep 18, 2018

MDP gives the mathematical formulation of Reinforcement Learning Problem

Markov Decision Process(MDP) is an environment with Markov states; Markov states satisfy the Markov Property: the state contains all the relevant information from the past to predict future. Mathematically,

pic taken from David Silver lecture slide

So if I say that the state S<t> is Markov, that means, it has all the important representations of the environment from previous states(which means you can throw away all the previous states). Think of it in this way: when you have the boarding pass, you do not need your ticket anymore to board the plane; your boarding pass already contains all the necessary information about boarding.

MDP is formally defined as:

Let’s take an example to develop intuition about MDP.

Student MDP Example from David Silver lecture slide

Suppose that you are a student and the figure above portrays the scenario of one of your day at school. The circles and square represents the states you can be in and the words in red are the actions that you can take depending on the state you are in; for example, in the state Class 1 you can choose whether you want to study or check your Facebook and depending on what actions you take, a numerical reward is given. There is also an action node(back dot in the figure) from where you can end up in different states depending on the transition probability; for example, after you decide to go to Pub from Class 3, you have 0.2 probability of getting into Class 1. This node shows the randomness of the environment over which you have no control. In all other cases, the transition probability is 1 and if the discount factor is 1 then MDP can be defined as:

Now that we have MDP, we need to solve it to find the best path that will maximize the sum of rewards, which is the goal of solving reinforcement learning problems. Formally, we need to find an optimal policy that will maximize the overall reward that an agent can get.

To solve MDP, we first have to know about the policy and value function.

In simple terms, policy tells you which actions to take. It is defined as:

Policy definition taken from David Silver lecture slide

For MDPs, the policy depends only on the current state.

Value function can be defined in two ways: state-value function and action-value function. State-value function tells you “how good” is the state you are in where as Action-value function tells you “how good” is it to take a particular action in a particular state. The “how good” of a state(or state-action pair) is defined in terms of expected future rewards.

The state-value function is defined as:

state-value function definition taken from David Silver lecture slide

Similarly, the action-value function is defined as:

action-value function taken from David Silver lecture slide

If we take the maximum of the value function over all policies, we get the optimal value function. Once we know the optimal value function, we can solve MDP to find the best policy.

The value functions that we defined above satisfy the Bellman equation; it states: “the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.”

For example, if we take the path from Class 1 to Class 2 then we can write the Bellman equation in the following way:

Bellman equation for value function

The Bellman optimality equation can be written in similar ways:

Bellman optimality equation for Value function

These concepts can be easily extended to multiple paths with different actions taking to different states. In this case, the Bellman optimality equation is:

Using above equation, we can find the optimal value function for each state in our student MDP example.

The optimal action-value can be expressed in similar fashion as:

This equation gives the following result in our student MDP example.

Once we have action-value function, we can find the optimal policy by taking their maximum. Formally, it would be:

The optimal policy, which will maximize the reward for our Student is shown by the red arcs in the figure below.

Summary:

MDP represents the reinforcement learning problem mathematically and the goal of solving MDP is to find an optimal policy that will maximize the sum of expected reward. Finding an optimal policy becomes easier once we have the action-value function. The intuition behind Bellman equation simplifies the process of finding action-value function.

References:

PS: I wrote this post based on my understanding of Reinforcement Learning. Any suggestion/improvement about the content and/or style of writing will be appreciated.

Markov Decision Process(MDP) Simplified

Written by Bibek Chaudhary