Reinforcement Learning Basics(Part 2): Solving a Finite MDP using Python

Mani
4 min readMay 31, 2023

--

Read the first part of the article to understand this example in a better way👉 -https://medium.com/@manikantan.srinivasan2001/reinforcement-learning-basics-solving-finite-mdps-5d748d6ec122

In this part, I will show you have to solve a simple Reinforcement learning problem using value iteration and policy improvement theorems. To understand what a Finite MDP is and how it can be solved, refer to the first part.

Environment

As described in the first part, let us construct an environment first:

states = [[0,1,2,3,4,5,6,7,8,9],
[10,11,12,13,14,15,16,17,18,19],
[20,21,22,23,24,25,26,27,28,29]]

terminal_states = [9,19,29] #states which will terminate the game

rewards = [[1,0,1,1,1,0,0,1,1,1],
[1,1,0,1,1,1,1,0,1,1],
[1,0,0,0,0,1,0,1,0,1]] #1 for positive reward/pathway, 0 for obstacle

The states variable holds a 3x10 matrix (list of 30 elements) that has 27 non terminal states and 3 terminal states (9,19,29). The rewards matrix has a reward specified for each of the 30 states — where 1 the pathway and 0 is the obstacle. The agent gets a reward of 1 for going through the pathway and a 0 for colliding with an obstacle.

We next define a dummy policy with 30 ‘down’ actions for each of the state to give a starting point to the algorithm. It is a way of initializing things.

policy = [['down','down','down','down','down','down','down','down','down','down'],
['down','down','down','down','down','down','down','down','down','down'],
['down','down','down','down','down','down','down','down','down','down']]
values = [[0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0]]

similarly we also initialise the value function values with 0s as shown above.

Next, We would have to define a bunch of hyper-parameters. To understand what each of these do, refer to the first part.

delta = 1e-400
episodes = 10000
gamma = 0.9

Now, let us look at the logic for updating the value of each state according to the reward it can obtain and the action taken by the agent.

 for a in actions:
val = rewards[i][j]
if(a=='up'):
val+= 0.33*(gamma*V[0][j+1])
elif(a=='middle'):
val+= 0.33*(gamma*V[1][j+1])
else:
val+= 0.33*(gamma*V[2][j+1])

the val variable gets a value based on the action and the reward. The number 0.33 is the probability of selecting the action and since there are 3 actions (‘down’, ‘up’, ‘middle’) the probability is 1/3. We multiply this to the discount factor gamma who’s value is 0.9. You can play with this hyper-parameter but the norm is to have a value close to 1 when the task has an episodic nature, like in our case.

We multiply gamma to the Value function of the previous iteration (V) and take the next index of the current column as that is the state we are interested in — hence we update it by V[i][j+1], where i can be 0,1, or 2 depending on the row the agent is in.

The complete loop is given below:

for episode in range(episodes):
max_diff = 0
V_new = [[0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0]]
for i in range(len(states)):
for j in range(len(states[i])):
max_val = 0
if(states[i][j] in terminal_states):
continue
else:
for a in actions:
val = rewards[i][j]
if(a=='up'):
val+= 0.33*(gamma*V[0][j+1])
elif(a=='middle'):
val+= 0.33*(gamma*V[1][j+1])
else:
val+= 0.33*(gamma*V[2][j+1])
max_val = max(max_val,val)
if(V[i][j]<val):
policy[i][j]=a #update policy according to the best action.
V_new[i][j] = max_val
max_diff = max(max_diff, abs(V[i][j] - V_new[i][j]))
V = V_new

if max_diff<delta:
break

We update the policy matrix with the best action that can be obtained from the value iteration. To understand the algorithm better you can refer to this book.

Output Policy

The policy we obtain after 10,000 iterations of the value-iteration algorithm is:

#Policy (optimal)
['middle', 'up', 'middle', 'middle', 'down', 'middle', 'down', 'middle', 'down', 'down']
['middle', 'up', 'middle', 'middle', 'down', 'middle', 'down', 'middle', 'down', 'down']
['middle', 'up', 'middle', 'middle', 'down', 'middle', 'down', 'middle', 'down', 'down']

#V-values (optimal)
[[1.422449529502159, 0.42238898822275767, 1.4221851455311705, 1.4214988065022571, 1.419187900681, 0.41140707300000007, 0.3852090000000001, 1.2970000000000002, 1.0, 0],
[1.422449529502159, 1.4223889882227576, 0.42218514553117037, 1.4214988065022571, 1.419187900681, 1.4114070730000001, 1.3852090000000001, 0.29700000000000004, 1.0, 0],
[1.422449529502159, 0.42238898822275767, 0.42218514553117037, 0.421498806502257, 0.4191879006810001, 1.4114070730000001, 0.3852090000000001, 1.2970000000000002, 0, 0]]

This policy tells us what action to take when the agent is in a particular state. The policy can be visualized as follows:

agent follows the path(red) according to policy we have obtained

You can verify this by looking at the reward matrix which has all 1s in the place the agent travels thus giving us the maximum expected return/reward.

You can get the complete code on github here — https://github.com/mani2001/TDS/blob/main/MDP_article_tds/MDP_solve.ipynb

Conclusion

To understand the vast sea of knowledge in Reinforcement learning, it is important to get your concepts straight, especially topics like MDPs, Monte Carlo methods, Dynamic programming, Bellman equation and so on. The book on RL by Richard S. Sutton and Andrew G. Barto is the best one out there on this field. Please share and like the post, if you found it useful! :)

https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf — SuttonBartoRL book

First part of the article— 5d748d6ec122https://medium.com/@manikantan.srinivasan2001/reinforcement-learning-basics-solving-finite-mdps-5d748d6ec122

GitHub — https://github.com/mani2001/TDS/blob/main/MDP_article_tds/MDP_solve.ipynb

--

--

Mani

I like to talk about intelligent systems among other things :)