Reinforcement Learning, Part 2: Introducing Markov Process

Step one to understanding MDP: the Markov Process

Published in

AI³ | Theory, Practice, Business

4 min readOct 24, 2019

Welcome back to my AI blog! In my last post, I gave a brief introduction to Reinforcement Learning. Today, I’ll help you continue your journey by introducing the Markov Process, which we will need to understand before broaching the Markov Decision Process (MDP) used in Reinforcement Learning.

By the end, you will have a basic knowledge of:

What the Markov Property and Markov Chain are;
How the Markov Property works;
How the Markov Chain put the Markov Property into action.

Introducing the Markov Process

To open our discussion, let’s lay out some key terminologies with their definitions from Wikipedia first. Then we’ll dig a little deeper.

Markov Property: In probability theory and statistics, the term Markov Property refers to the memoryless property of a stochastic — or randomly determined — process.

Markov Chain: A Markov Chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

Expanding on the Markov Property

To deepen our understanding of the Markov Property, we can view it as follows:

P(X(t+1)=j|X(0)=i0,X(1)=i1,…,X(t)=i)=P(X(t+1)=j|X(t)=i)

Put in words, the formula represents a situation in which the state of X at time t+1 only depends on one preceding state of X at time t, and is independent of past states X(t−1), …, X(1).

Now let’s shed more light on this with a simple example.

In string “easy”, according to Markov Property, we have:

P(x3=y | x0=e, x1=a, x2=s) represents the probability that y appears at time 3 when e appears at time 0, a appears at time 1 and s appears at time 2
P(x3=y|x2=s) represents the probability that y appears at time 3 when s appears at time 2

So, in the above equation, Markov Property makes the P(easy) easier to compute with the assumption that y only depends on the previous neighbor state s and is independent of e and a. It means that when y in “easy” is generated, we only care about the transition probability from s to y instead of the transition probability from eas to y.

Of course, we know it may not work like this in the real world, but the hypothesis is useful nonetheless. It helps us make complicated situations computable and most of the time it works quite well.

Understanding the Markov Chain

When we put the Markov Property to work in a random process, we call it a Markov Chain.

Here is the formulated definition of a Markov Chain:

Using Figure 1 above, we can demonstrate how a Markov Chain can generate words.

Assume we start separately from state e, a, and t, with the respective probability of 40%, 30%, and 30%. According to Markov Property, a string can be generated letter by letter — taking into consideration only the letter immediately before it.

For example, we have a 40% probability of starting with e at time 0. Then we move from state e to state a at time 1 to get ea. To arrive at the word eat, we move directly from state a to state t at time 2, without regard for the earlier state e.

With the above computations, we can see that this Markov Chain gives eat and tea an equally high score, while aet gets the lowest score. The formula indicates that eat and tea are more like words, while aet appears not to be one at all.

Summary

In this short introduction of Markov, we’ve learned:

How the Markov Property and Chain are defined.
How the Markov Property can compute word probability.
How the Markov Chain can generate words.

Now, we’re primed and ready for our discussion of the Markov Decision Process. Coming next week; don’t miss it!