RL—Markov Decision Processes

Ganesh Walavalkar
6 min readApr 24, 2020

There are three main types of learning as follows:
1. Supervised Learning — where label for the data is available
2. Unsupervised Learning — where the label for the data is not available
3. Reinforcement Learning — where agents learns actions in order to maximize ‘cumulative’ rewards.

RL focuses on finding balance between exploration of uncharted data, and exploitation of current knowledge of the data.

The environment (or the world) is typically stated in the form of Markov Decision Process (MDP)

Markov Decision Process

MDP is a framework, which consists of following elements:
States : S
Model : T(s, a, s’) ~ Pr(s’ | s,a)
Actions : A(s), A
Reward : R(s), R(s, a), R(s, a, s’)
Policy : ∏(s) → a (Policy is solution. It is a function, which takes in state and returns action to be taken. It is not a complete solution, however it is a solution for a state that you are in)
Where
- T = Transition, Pr = Probability.

In MDP only present state matters, past does not matter, almost anything can be turned into Markovian. The roles, world and model are stationary. Rewards represent the domain knowledge. One always know the state that one is in and the reward one will receive to move to next desired and legal state. The sequence of actions, based on the policy is called plan.

The action is taken based on result of cumulative sequence of actions resulting from and in expectation of cumulative rewards. The cumulative rewards should be optimized. This is very hard problem to solve, and it is called as credit assignment problem. Since the credits are accumulated over time, it is called as temporal credit assignment problem.

Optimal Policy

The policy which results in least negative reward (or most positive reward) while moving towards the goal (avoiding undesired goals) is optimal policy for that state. So, in a situation where end reward is significantly larger, say 100, and then there is penalty of -4 for each step, then optimal policy will be to move to next state, which will take user to end goal in quickest possible way.

The rewards should be tuned carefully, as incorrect rewards will give incentive for player not to move towards the end goal. Instead of hot beach, which will force player to move towards shade, a pleasant beach full of diamonds, will keep the player on the beach picking up diamonds.

Sequence of Rewards

We are assuming infinite horizons. However in real world, the steps are limited or time left is finite, and hence reaching goal quicker makes much more sense. Now this means that with infinite horizons, one tries to avoid the risks as much as possible, however if the infinite horizons assumption is abandoned, then taking risk may become meaningful leading to change in policy.

The assumption is utility of sequences.
if U(S0, S1, S2, …) > U(S0, S’1, S’2, …)
then U(S1, S2, …) > U(S’1, S’2, …)
This is called as stationarity of preferences, which is time independent. This forces adding the rewards of the sequence of states.

Consider a situation, where there are two utility ‘unending’ sequences, where utility of 1 sequence is — 1, 1, 1, …; while utility of the second sequence is — 2, 1, 2, … . Both the sequences are similar in nature as they are ‘unending’ and hence the sum will be eventually infinite. Now this situation — that is both the sequences have same utility can be changed, by discounting the rewards. It will exponentially implode the utility summation, thus making it ‘finite’. This can be put mathematically in following way:

U(S0, S1, S2, …) = ∑ R(St) … (t = 0 to infinity)
= ∑ γ^t R(St) … (t = 0 to infinity) 0 ≤ γ ≤1
≤ ∑ γ^t RMAX = Rmax/1-γ

Now when the Ss are discounted by γ, then as explained earlier, it will ‘implode’ exponentially. Hence there could be ‘infinite’ steps, however they will converge to a finite number.

Extending what we discussed —
∑ (γ^t) Rmax
Let us assign X = (γ⁰ + γ¹ + γ³ + … )
Since this equation is recursive, if it is shifted one over in time, it will become
X = γ⁰ + γX
X — γX = γ⁰
X(1-γ) = 1
X = 1/(1-γ) * Rmax (bring initial Rmax back into equation)

Policies

Optimal policy is the one which maximizes long term expected reward.

∏^* = argmax E [ ∑ γ^t R(St) | ∏ ]

This equation is — the expected value of the sum of discounted rewards at time t given pi (policy). Now we can start at any state. Now, this same function can be expressed as utility function as follows:

U^*(S) = E [ ∑ γ^t R(St) | ∏, So = S]

Note that this is the utility of that state. Also reward for that state is not same as the utility of that state. This is obvious as reward is immediate, while utility is long term accumulated reward. Can you go to college pay $40K for education, where long term utility is much more? It is OK to take short term negative reward for long term positive utility.

Let us rewrite the long term policy equation as follows

∏^*(s) = argmax ∑ T(s, a, s’) U(s’)

Note U(s’) means Utility of the state, if we follow the optimal policy, however the problem is we do not know optimal policy, YET.

Optimal action to take at a state → at every state return the action that maximizes my expected utility. Now this is bit circular, however can be solved with recursion.

U(S) = R(S) + γ max ∑ T(s, a, s’) U (s’) ← Bellman Equation

True utility of a state is the reward in that state, plus the discount of all the reward at that point. This is defined as the utility at that state defined recursively. So in short U (s’) will become U (s’’)

This is relatively similar to discounted cash flow in finance, not exactly, in the sense that in finance, we do not calculate the discounts recursively, however here we do.

So totally there are n equations, and we have n unknown, hence this equation could have been solvable. However we have max, which makes it non-linear.

So how to solve this equation in case, where it could be non-solvable.
- start with arbitrary utility
- update the utility based on the neighbor
- repeat with (till) convergence,

Hence the new equation will be as follows:

Û(s) (for time t+1)
= R(s) + γ max ∑ T(s, a, s’) Û (s’) (for time t) [Summation is for s’]
This method/equation is called as value iteration

Why will this work:
- R(s) is the truth, i.e. it is the true value, as we enter into a state. So this ‘true reward’ will be propagated out. This will be done for all function.
- We could start with wrong R(s). However as we iterate (recurse) over this equation, eventually it will go closer towards truth, thus diminishing the value of wrong initial utility (or reward)

Finding Policy

The algorithm to find the policy could be given as follows:
- Start with some policy ∏o ← this is a guess policy
- evaluate: given ∏t → calculate Ut = U^∏t ← how good is this policy, find out the utility of that policy
- improve: ∏t+1 = argmax ∑ T(s, a, s’) Û (s’) ← once utility if found improve it based on value iteration.

U(s) (for time t)
= R(s) + γ ∑ T(s, ∏t(s), s’) U (s’) (for time t) [Summation is for s’]
Since max is not applicable in this equation, this equation will not be non-linear, hence always solvable. ∏t → is a constant.
This class of algorithm is called as policy iterations.

By converting from non linear to linear equation, this makes solving the equation easier, and it is guaranteed to converge. Very similar to K Means in terms of how it will converge.

Conclusion

We discussed MDP — states, rewards, actions, transition. discount is parameter that can be fiddled with however it can be considered as a part of MDP. We looked at policies. We also looked at value function and policy function. Finally we looked at Bellman equation → value iteration and policy iteration.

References — Heavily borrowed from my notes on CS7641, so thanks to Prof Charles Isbell and Prof Michael Littman. Errors are all mine.

--

--