Markov Decision Processes and Bellman Equations

Published in

Analytics Vidhya

6 min readApr 17, 2020

In the previous post, we dived into the world of Reinforcement Learning and learnt about some very basic but important terminologies of the field. Today, I would like to discuss how can we frame a task as an RL problem and discuss Bellman Equations too. Bellman Equations are an absolute necessity when trying to solve RL problems. Hence, I was extra careful about my writing about this topic.

Types of RL tasks

All RL tasks can be divided into two types:
1. Episodic tasks: Talking about the learning to walk example from the previous post, we can see that the agent must learn to walk to a destination point on its own. What happens when the agent successfully reaches the destination point? Since that was all there is to the task, now the agent can start at the starting position again and try to reach the destination more efficiently. This is an example of an episodic task. In such tasks, the agent environment breaks down into a sequence of episodes. Episodic tasks are mathematically easier because each action affects only the finite number of rewards subsequently received during the episode.
2. Continuing tasks: I am sure the readers will be familiar with the endless running games like Subway Surfers and Temple Run. Now, imagine an agent trying to learn to play these games to maximize the score. But, these games have no end. This is an example of a continuing task. Another example is an agent that must assign incoming HTTP requests to various servers across the world. This task will continue as long as the servers are online and can be thought of as a continuing task.

Markov Decision Processes — The future depends on what I do now!

In Reinforcement Learning, all problems can be framed as Markov Decision Processes(MDPs). A fundamental property of all MDPs is that the future states depend only upon the current state. It is because the current state is supposed to have all the information about the past and the present and hence, the future is dependant only on the current state.

In more technical terms, the future and the past are conditionally independent, given the present.

A typical Agent-Environment interaction in a Markov Decision Process. (Source: Sutton and Barto)

Let us look at an example of an MDP:

Example of a simple MDP with three states (green circles) and two actions (orange circles), with two rewards (orange arrows). Source

In the above image, there are three states: S₀, S₁, S₂ and 2 possible actions in each state: a₀, a₁. The numbers on those arrows represent the transition probabilities. For example, if an agent starts in state S₀ and takes action a₀, there is a 50% probability that the agent lands in state S₂ and another 50% probability that the agent returns to state S₀. In this MDP, 2 rewards can be obtained by taking a₁ in S₂ or taking a₀ in S₁.

Bellman Equations

A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy recursive relationships as shown below:

Bellman Equation for the value function. (Source: Sutton and Barto)

We know that the value of a state is the total expected reward from that state up to the final state. It can also be thought of in the following manner: if we take an action a in state s and end in state s’, then the value of state s is the sum of the reward obtained by taking action a in state s and the value of the state s’.

Similarly for Q-value:

Bellman Expectation Equations

This recursive update property of Bellman equations facilitates updating of both state-value and action-value function. As the agent progresses from state to state following policy π:

The recursive property allows the state-value and action-value functions from state to state. (Source)

Bellman Optimality Equations

If we consider only the optimal values, then we consider only the maximum values instead of the values obtained by following policy π.

It must be pretty clear that if the agent is familiar with the dynamics of the environment, finding the optimal values is possible. But, the transitional probabilities Pᵃₛₛ’ and R(s, a) are unknown for most problems. Still, the Bellman Equations form the basis for many RL algorithms.

Dynamic Programming — Finding the optimal policy when the environment’s model is known

If the model of the environment is known, Dynamic Programming can be used along with the Bellman Equations to obtain the optimal policy. This requires two basic steps:

Policy Evaluation

Compute the state-value Vπ for a policy π. This is called Policy Evaluation.

where π(a|s) is the probability of taking action a in state s under policy π, and the expectations are subscripted by π to indicate that they are conditional on π being followed.

Policy Improvement

Suppose we have determined the value function Vπ for an arbitrary deterministic policy π. For some state s we would like to know whether or not we should change the policy to deterministically choose an action a ≠ π(s).
One way is to select a in s and thereafter follow the existing policy π.

Suppose choosing an action a ≠ π(s) and following the existing policy π than choosing the action suggested by the current policy, then it is expected that every time state s is encountered, choosing action a will always be better than choosing the action suggested by π(s). This results in a better overall policy. This is the policy improvement theorem.

Policy Iteration

Once a policy, π, has been improved using Vπ to yield a better policy, π’, we can then compute Vπ’ and improve it again to yield an even better π’’. We can thus obtain a sequence of monotonically improving policies and value functions:

Policy Iteration to obtain the optimal policy

Say, we have a policy π and then generate an improved version π′ by greedily taking actions. The value of this improved π′ is guaranteed to be better because:

This is it for this one. I did not touch upon the Dynamic Programming topic in detail because this series is going to be more focused on Model Free algorithms.
In the next tutorial, let us talk about Monte-Carlo methods.