Deep Reinforcement Learning Part 2: Markov Decision Process

data_datum
4 min readSep 12, 2018

In this blogpost I’ll introduce concepts such as:

Markov Chain or Markov Process

What we observe are called states and the system can switch between states according to some rules of dynamics.

All posible states for a system form a set called state space. Our observations form a sequence of states or a chain. A sequence of observations over time forms a chain of states, such as [sunny, sunny, rainy, rainy,…] and is called history (1).

Fig 1. Example of a state Markov chain

Markov Property and Transition Matrix

Markov Property (MP) means that the future of the dynamics from any state have to depend on this state only. MP requires that the states of the system to be distinguisheable from each other and unique.

If our model is more complex, and we need to extend it, this will capture more dependencies in the model at the cost of a larger state space.

Fig 2. Markov Chain Model and Transition Matrix (2).

From Fig 2. we can observe that the edges in the markov chain are probabilities that are expressed in the transition matrix. A transition matrix is a square matrix of the size NxN, where N is the number of states in our model. This matrix defines the system dynamics. If the probability of transition is 0, we don’t draw an edge (there is no probability to go from one state to another)(1).

We also can define mathematically a markov chain based on the states and the transition matrix.

Fig 3. Mathematical definition of a markov chain (3)

Reward

A reward signal defines the goal of a reinforcement learning problem. The agent’s objective is to maximize the total reward it receives over the long run. It defines what are the good and bad events for the agent (4).

The expected cumulative reward or G is the sum of the reward signal and the Gamma (γ) or discount factor. Gamma (γ) is a hyperparameter to tune in order to get optimum results. These values range between 0.9 and 0.99. While a lower error encourages short-term thinking, a higher-value emphasizes long-term results (5).

Fig 4. The Return equation (3).

If we add the reward and the discount factor to our definition of a markov chain we have a Markov reward process (MRP).

Fig 5. Mathematical definition of a markov reward process (3)

Value Function

A value function -or function of states- estimates how good it is for the agent to be in a given state (or how good is to perform a given action in a given state). Value functions are defined with respect of particular ways of acting, called, policies (4).

Fig 6. (State) Value Function Definition (3)

Bellman’s Equation

Bellman’s equation (BE) is a linear function that allows to decompose the state value function as a sum of an inmediate reward [Rt+1], and a discounted value of sucessor’s state [γv(St+1)]. Also it can be expressed using matrix notation (3).

Fig 7. Bellman’s Equation is a linear function that can be expressed using matrices (3)

Markov Decision Process

Almost all reinforcement learning problems can be formalized as markov decision process (MDPs). Every state in MDPs is markovian or satisfies the Markov property and the environment is fully observable (1)(3).

Fig 8. MDP definition (3)

MDP framework is a considerable abstraction of the problem of goal-oriented learning from interaction. Every goal-directed behavior can be reduced to three signals between an agent and the environment: the actions, that represents the choices made by the agent; the states, that are choices made by the agent; and the rewards, that define the agent’s goal (4).

Policy

The policy is the set of rules that controls the agent’s behavior. It is defined as the probability distribution over actions for every possible state. To introduce randomness into an agent’s behavior is defined as a probability. If our policy is fixed and not changing, then our MDP becomes an MRP (1). Concisely, we can say that a policy is a solution to the MDP problem (5).

Fig 9. Mathematical definition of a policy (3)

Finally, we can define the optimal policy π* as the policy that maximizes the expected reward (received or expected to receive over a lifetime).

For the first part of these series of posts you can visit here Part 1

References

  1. Lapan, Maxim. Deep Reinforcement Learning Hands-On (2018). http://bit.ly/2wosxGD
  2. Raval, Siraj. Introduction (Move 37) A free deep reinforcement leaning course (2018) http://bit.ly/2CDyrJB
  3. Silver, David. Reinforcement Learning DeepMind Course (2015) Class 2. Youtube video: http://bit.ly/2CS4k1d and the slides http://bit.ly/2CDFcer
  4. Sutton, Richard & Barto, Andrew. Reinforcement Learning 2nd Edition Draft (2018) http://bit.ly/2CLW5Dv
  5. Raval, Siraj. Move 37 A free deep reinforcement learning course http://bit.ly/2x4kqhY

--

--