Reinforcement Learning formulation for Markov Decision Process and Multi-Armed Bandit

Mehul Gupta
Data Science in your pocket
5 min readFeb 26, 2020

--

I have explored the basics of Reinforcement Learning in the previous post & now will be going at a more advanced level. Reinforcement comes in a lot of forms that I shall be pointing out below.

For any term appearing ghostly👻 , kindly check out here!!

Before starting, let's dive deeper into the epsilon-greedy approach.

The policy starts with generating a random number. If it is greater than some value(threshold i.e. epsilon set by us), we will be choosing the action returning maximum present reward(exploit) else take a random action(explore). We take such an approach to maintain the explore-exploit tradeoff discussed in my previous blog.

Consider the below environment.

Here, if we are at (900,700), moving forward to (900,450) will take us close to the top red star(end, 1200,100), hence let’s consider it as the highest rewarding action for a given state. On generating a random number, if it is greater than the set epsilon, we will move forward else select any action randomly (can be even left & right) which isn’t possible, hence we might get punished for this.

Now coming back on track

Multi-Armed Bandit

This is what an armed bandit looks like!!

Consider yourself in a casino. You are given 5 such slot machines with an arm attached to each machine. You would definitely want to win big!!

But how would you know which armed to pull for the maximum reward?

I know a trick!! Reinforcement Learning.

MAB(Multi-Armed Bandit) problems are stateless Reinforcement Learning problems where we take one action(pull the bandit) & get a reward. It is not like we need to take a sequence of steps to get the final reward(like in snakes & ladders where you move closer to the final reward step by step). It can be also taken as a system with only 2 states, initial & final.

Solving a MAB is quite easy.

  • Initialize an array M[x] where x=total number of slot machines in our case with 0
  • Choose an action(depending on the policy chosen), If it is the first instance a particular slot_machine is chosen, update M[x]=Reward else update M[x] with an average of all previous rewards.
  • Repeat step 2 depending on how much you wish to train the system.

Markov Decision Process

You are familiar with the above picture. Snakes & Ladders!!!

Consider you are on 37. What happens if you get a 1? you take the ladder. But has it something to do with when you were at 35? Not really. Your future position(taking the ladder to 43) has only to do with your present(37) state & not on any other past position(35,27,or whatever).

Such a system where states owe Markov Property(dependency of the future only on the present state and not on any past state) is what we call a Markov Decision Process. MDPs are examples of Stateful Reinforcement Problems as we have more than 2 states in the system i.e. we have intermediate states as well like 20, 30, 40, etc., apart from the initial state(1) & final state(50) and the entire journey can’t be ended without reaching intermediate states.

The above example is that of a Finite Markov Decision Process as the number of states is finite(total 50 states from 1–50). If the states would be indefinite, it is simply called a Markov Process.

When we will be training an agent to play Snakes & Ladders, we want our policy to give less preference to reaching 45 (snake) as it will take us away from our goal(50). Also, we want our policy to give more preference to 38(ladder).

But such a preference can only be given when we know how close this state can take us to the end. In Short, on the future rewards which we may receive if this step is followed. Hence we use:

Action-Value Function

It is (value|state,action) we keep on updating to know how much should we prefer the present state. This ‘preference’ is generally made by estimating the future rewards we might get if present on that state.

  • Return=Reward(R1+aR2+a²R3…..|Given State, Chosen Action) where ‘a’ is the Discount Factor(sort of exponential weightage, the farther the reward, less the weightage). Future rewards are given less weightage as we haven’t received them but assuming we may receive them.
  • Reward() denotes the expected value which generates a random value determining whether the policy will be followed or a random state is chosen. Epsilon-greedy can be used as E(). Do look for the below algo for more clarification.
  • A value function refers to (Value|State). It must be noted that there is no Action here.

One more problem exists!!

How to generate rewards for future states?

Dynamic Programming is the key. Not aware of it? do check it out here.

Will pen down a small algo first(assuming the following e-greedy policy):

  • Initialize a matrix M(s,a) to 0(setting all values|state,action to 0) for s->S,a->A where S → set of States, A →set of Actions
  • Function Reward(Current_State):
    -Deciding on the action to be taken using the e-greedy policy(any other policy can also be followed)
    -Returning rewards on the basis of chosen action like we can give small rewards when making a legal move, negative rewards when illegal_move(out of the maze, on blocked_path), and big rewards on reaching final_state. The magnitude can be set up by the dev.
  • Function updating_action_value(Current_State):
    if current_state=final_state : return 0
    M(current_state)=Reward(Current_State)+ a*updating_action_value(Next_state_reached)
  • For x — >Number_of_epochs/episodes_for_learning: Call updating_action_value(initial_state)

Beginning one by one, we will initialize a Matrix M(dimension SxA, S=total states, A=all possible actions) with 0. These are the initial values for the Action-Value function for each state-action pair in the environment we are using.

Updating_action_value: The first line is to determine the final state of the Dynamic Programming approach used. If the final state isn’t reached, we will update the matrix M for the current_state. First-term returns the present reward for the action taken while the latter is used for future rewards.’ a’ is the discount factor.

Now for learning purposes, we will run updating_action_value(initial_state) for any number of epochs(as in Deep Learning). For determining future rewards, we can again follow e-greedy or just exploitation returning the best reward.

For a practical experience, do explore the Maze game using RL(Q Learning).

As we are done on how to formulate a Reinforcement Problem, I will be penning down about Monte Carlo, Off & On Policy Q Learning next.

--

--