Reinforcement Learning Basics (Part 1): Solving Finite MDPs

7 min readMay 31, 2023

Though I would recommend you to read this article first,If you want to directly skip to the python example, visit here.

The most basic cases of Reinforcement learning can be represented using a set of states and actions. I will explain this using a visual example. Let’s say you are playing a game like Flappy Bird, but with 3 controls — you can either go ‘up’, ‘down’, or ‘stay in the middle’. To play this game let us construct an environment. To keep things simple, we create a matrix-like diagram to define our Env:

In the image above, we have a 3x5 grid, where the first column is the starting position for the bird(marked in blue) and the last column is the winning point (except the 1st row). We have Pipes(marked in red) at random locations in the grid which the bird must avoid to win the game.

This may look trivial to us but for an AI agent, it is not so. Like a child would try playing the game and failing a couple of times before understanding how to win, a machine would need to ‘practice’ or ‘learn’ to beat the game. To understand how to do this, we must get a little technical.

To represent the above grid and the game mathematically, we would need a set of variables namely — states(S), actions(A), policy(pi), and rewards(R).

The states are the cells of the grid in our case. We have a 3*5 grid and thus a total of 15 states. The actions we can take are simple, at every instance, in the next column of the grid, we can either go down, stay in the middle, or go to the top (as shown in the figure). Thus, states S=>{0,1,…,15} and actions A=>[‘up’, ’middle’, ’down’].

Rewards

Rewards are like treats given to pets. When the agent takes a correct decision, we reward it a number of points (a scalar value), when it does not take the right decision, we either give it no reward or a negative reward. In our example, if the bird hits an obstacle it gets a negative reward. Rewards play an important role as they are the basis of something known as the value function- which is the average of all the future rewards from the current state. To learn more about this function, we must first define the meaning of a policy.

Policy

A policy is formally defined as a mapping from states to actions.

A policy is like a ‘guide’ or ‘map’ that tells us what action to take when in a particular state. It is the most important aspect of a reinforcement learning problem. When we say that the agent has learnt to solve a task, we usually mean that it has learnt an optimal policy or a ‘navigation-manual’ for every single state in the environment, i.e. the agent knows what action to take at every single state where it can exist. To find an optimal policy, we must first know how good a certain state is and whether moving to that state is useful in the long run or not. This is where a value function comes into play.

Value Function

A value function is a single scalar value that exists for every state in the enviroment. It quantifies the state, or in simple words, it tells us how useful it is to be in a certain state by taking a certain action. Mathematically, it can be defined as:

The goal here is to find an optimal value for each state to know whether to visit a certain state or not. This value will define the policy for the agent. To understand this, consider the grid of optimal values for each state:

It is obvious that- lower the negative values, the more the agent must avoid that state. In our Flappy game example — These are the states with the obstacles that will result in failure of the task (if touched). So the policy would tell us to take actions in such a way that not only the immediate reward is high but also that the agent finds its goal (total high reward/return). Finding an optimal value function is the backbone of most reinforcement learning methods.

Finite MDP and Optimal policy evaluation

Most of the simple things we do in life are Markov decision processes, also known as MDPs. A problem is said to be a Markov decision process if it follows the Markov Property.

Informally, The Markov property tells us that the future states and actions of an agent depend only on the current state and the action taken and not the past history of the agent.

For example if you were playing football, only the current position of the ball and the force applied on it would matter to determine where the ball would land and not the past history of states describing the ball. This is the gist of the Markov property.

Most of the tasks we face are either a Markov decision process or can be approximated to one. Since we also have a finite number of states in the game described above (15 cells), we can say that the problem is a finite Markov decision process. It would only matter where the bird is at currently and not where it was in the past.

To get an optimal Policy we must evaluate multiple policies given by different value-functions until we find the best possible function. The optimal function will then give us the best policy to follow.

There are many ways to solve a MDP. One can use tabular methods like dynamic programming or function approximates (if the number of states is too high). Since our problem has only 15 states we can use dynamic programming for policy evaluation. The method to do this is called, value iteration and is part of the policy evaluation process.

Policy Evaluation

To perform policy evaluation and value iteration, the following algorithm is used:

I will try to break this down for you. We first initialise an array or matrix for V, the value function to store the values of all the states s=>S. We then define a variable Δ and give it the value of 0.

V(s) will give us the total expected return/reward of entire task from the next state s’ to the last state terminal state. This process is done recursively to create a dependency between the different states.

Next, we define the update rule for V(s) for every state in the set of states. This is done using the bellman equation:

In this equation, p(s’,r|s,a) is the probability of obtaining state s’ and reward r given current state s and action a. The variable gamma, is the discount factor. We multiply this with the old value function. This is done to discount the future rewards to a lower value and the immediate rewards to a higher value. To get a more formal definition and derivation of this equation refer to this book.

This update rule is done a number of times until V(s) converges to the optimal value for all s=>S and all actions a=>A. We stop the loop when the difference between the old V value and the new one is very small (given by the last line). This gives us the optimal value function which in-turn gives us the best policy.

Summary

The process of solving the MDP goes in the following order:

define the reward, states, actions and other environment variables.
initialise the policy and value function to 0 or None or a dummy value.
iterate for a number of episodes
during each episode, update V(s) using the bellman equation, which will use the previous iteration’s V value of state V(S’) in a recursive manner. (Refer to this book for more details).
stop the loop upon convergence (when the difference of the new and old values are very small)
update the policy array with the actions corresponding to the optimal value function for each state.

To understand this by a coding example check out the second part of this article — https://medium.com/@manikantan.srinivasan2001/reinforcement-learning-basics-solving-a-finite-mdp-using-python-61f2c0d96bb0

Refer this book for details on the mathematics — https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf