# An introduction of Markov Decision Processes

Markov Decision Process (MDP) is one of the most important basic concept in Reinforcement learning. In Machine Learning (ML), there are three classification of algorithm, Supervised Learning (SL), Unsupervised Learning, and Reinforcement Learning (RL). RL is to learn a way to get the highest reward. Besides, the difference between RL and SL is that the label of SL is the correct behaviour, while the label in RL means the reward.

There are 4 parameters in MDP.

1. S (states) — a set of states
2. A (actions) — a set of actions
3. P — the probability of s move to s’ based on action
4. R — the reward can get when using an action in a state

MDP is based on the theory of Markov Process, but they are a little different because action is considered in MDP. In other words, the next state (s’) is related to the current state (s) and the action (a). In addition, we do not need to consider the previous state and previous action. It is only affected by the current situation and the action (which let s — >s’).

For RL, we sometimes cannot get all the information, so there are two categories according to different situations. One is “model-based” and the other is “model-free”. The former one is that we can know and save all the information in MDP while the later is that we need to explore unknown information in Markov Process.

The goal of RL is to decide a strategy, called it as “Policy” in RL. It is a kind of function. The current state is input and the probability of using this action (usually use pi(s,a)) is output. The aim is not to find the highest reward in current since it may not be the best strategy in the whole system. Therefore, a parameter (gamma) is to balance the current reward and the whole reward. It also can avoid the infinite result.

According to this concept and condition, we can define a descending reward expectation as below.

v(s) = E[sigma(rR)]

Next, people expand the application of state-action pair with values. It can be defined as

q(s,a) = R + E[sigma(rR)]

In conclusion, RL is to find a strategy (pi) which values in every state is the best. However, we are not sure the values of each state in one strategy is better than those values in another strategy. If so, our aim may become blurry. Fortunately, MDP give us this promise.

For any MDP, there exists an optimal policy pi* that is better than or equal to all the other policies.
pi* ≥ pi, given every pi