Bellman Equation and dynamic programming

Sanchit Tanwar
Analytics Vidhya
Published in
3 min readJan 28, 2019

Signup for my live computer vision course: https://bit.ly/cv_coursem

This is a series of articles on reinforcement learning and if you are new and have not studied earlier one please do read(links at the last of this article). Till now we have discussed only the basics of reinforcement learning and how to formulate the reinforcement learning problem using Markov decision process(MDP). From now onward we will work on solving the MDP.

Bellman Equation

If you have read anything related to reinforcement learning you must have encountered bellman equation somewhere. Bellman equation is the basic block of solving reinforcement learning and is omnipresent in RL. It helps us to solve MDP. To solve means finding the optimal policy and value functions.

The optimal value function V*(S) is one that yields maximum value.

The value of a given state is equal to the max action (action which maximizes the value) of the reward of the optimal action in the given state and add a discount factor multiplied by the next state’s Value from the Bellman Equation.

Bellman equation for deterministic environment

Let's understand this equation, V(s) is the value for being in a certain state. V(s’) is the value for being in the next state that we will end up in after taking action a. R(s, a) is the reward we get after taking action a in state s. As we can take different actions so we use maximum because our agent wants to be in the optimal state. γ is the discount factor as discussed earlier. This is the bellman equation in the deterministic environment (discussed in part 1). It will be slightly different for a non-deterministic environment or stochastic environment.

Bellman equation for stochastic environment

In a stochastic environment when we take an action it is not confirmed that we will end up in a particular next state and there is a probability of ending in a particular state. P(s, a,s’) is the probability of ending is state s’ from s by taking action a. This is summed up to a total number of future states. For example, if by taking an action we can end up in 3 states s₁,s₂, and s₃ from state s with a probability of 0.2, 0.2 and 0.6. The Bellman equation will be

V(s) = maxₐ(R(s,a) + γ(0.2*V(s₁) + 0.2*V(s₂) + 0.6*V(s₃) )

We can solve the Bellman equation using a special technique called dynamic programming.

Dynamic Programming

Dynamic programming (DP) is a technique for solving complex problems. In DP, instead of solving complex problems one at a time, we break the problem into simple subproblems, then for each sub-problem, we compute and store the solution. If the same subproblem occurs, we will not recompute, instead, we use the already computed solution.

We solve a Bellman equation using two powerful algorithms:

  • Value Iteration
  • Policy Iteration

Value Iteration

We will learn it using diagrams and programs.

Value iteration

In value iteration, we start off with a random value function. As the value table is not optimized if randomly initialized we optimize it iteratively.

Let’s start with programming we will use open ai gym and numpy for this.

Value Iteration

Policy Iteration

Policy Iteration

In Policy Iteration the actions which the agent needs to take are decided or initialized first and the value table is created according to the policy.

Code for policy iteration:

Policy Iteration

Reinforcement learning series

  1. Introduction to reinforcement learning.
  2. Markov chains and markov decision process.
  3. Bellman equation and dynamic programming → You are here.

References

  1. Hands on reinforcement learning with python by Sudarshan Ravichandran
  2. https://medium.com/@taggatle/02-reinforcement-learning-move-37-the-bellman-equation-254375be82bd

--

--