Reinforcement Learning, Part 4: Optimal Policy Search with MDP

Training an agent to make decisions that will maximize rewards over time

dan lee
AI³ | Theory, Practice, Business
6 min readNov 9, 2019

--

Welcome back to my AI blog! We’ve already learned a lot, so let’s recap what we’ve covered in my Reinforcement Learning series so far:

Part 1: A Brief Introduction To Reinforcement Learning (RL)

Part 2: Introducing the Markov Process

Part 3: Markov Decision Process (MDP)

The last step in using MDP is an optimal policy search — which we’ll cover today.

By the end of this article, you will have a basic knowledge of:

  • How to evaluate an action;
  • How to implement an optimal policy search.

As you’ll recall from part three, we’ve been describing and evaluating states and actions for a young man named Adam.

Now, our agent is nearly ready to help him make the sequential decisions that will lead to earning the greatest possible amount of money. Let’s do it!

Optimal Policy Search: How Agents Choose the Best Path

Now that we understand MDP, we can think of it as the environment in which an agent works. To maximize rewards over time, this agent needs to find an optimal policy. That is, it must determine the best action to take at each state.

optimal policy: the best action to take at each state, for maximum rewards over time

To help our agent do this, we need two things:

  1. A way to determine the value of a state in MDP.
  2. An estimated value of an action taken at a particular state.

1. Bellman Optimality Equation

The Bellman Optimality Equation gives us the means to estimate the optimal value of each state, noted V*(s). It estimates the value of a state by computing the expected rewards that the state can generate.

Here is the Bellman Optimality Equation (state-value function):

In which:

  • P(s, a, s’) is the transition probability from state s to state s’ when the agent chooses an action a.
  • R(s, a, s’) is the immediate reward from state s to state s’ when the agent chooses an action a.
  • 𝛾 is the discounted rate. The discounted reward is written in a recursive way.

You may wonder how to translate the discounted reward we introduced in my last post into a recursive notation. Perhaps the computing process below will give you a hint.

In practice, you can first initialize all state-value estimates to zero. Then you iteratively update them with the results computed by the formula above. It will prove convergent.

2. Q-Value Iteration Algorithm

The optimal state value alone doesn’t tell the agent which action to take in each state. Fortunately — Inspired by the above concept of optimal state value — Bellman provides us with a similar formula to estimate the value of state-action pairs, named Q-Value.

Q-Value: estimated value of action a taken at state s; noted as Q*(s, a)

The Q-Value explicitly tells the agent which action should be chosen at each state, according to the Q-Value score.

Here is how to compute Q-Value:

In which:

  • P(s, a, s’) is the transition probability from state s to state s’ when the agent chooses an action a.
  • R(s, a, s’) is the immediate reward from state s to state s’ when the agent chooses an action a.
  • 𝛾 is the discounted rate. The discounted reward is written in a recursive way.
  • max a’ Qk(s’, a’) is the value of the optimal action a’ at state s’

Note: this function assumes the agent acts optimally after it chooses the action at the current state.

As with the Bellman Optimality EquaEquation practice, you’ll start by initializing all the Q-value estimates to zero. Then you update them iteratively with the results computed by this formula.

An MDP Implementation of Optimal Policy Search

Now let’s put what we’ve learned in action! We’ll use our friend Adam’s MDP in Figure above to run an optimal policy search.

First, we input the 5-tuple (S, A, P, R, 𝛾) into our demo to create an MDP environment.

  • We use nan to represent the states at which we can’t arrive from the previous state.
  • The variant actions is A. The action space with shape (s, a) is a 2-dimensional array.
  • P represents the transition probability from state s to state s’, choosing action a. Its shape should be (s, a, s’), a 3-dimensional array.
  • R represents the immediate reward received after a transition from state s to s’, due to action a. Its shape should be (s, a, s’), a 3-dimensional array.
import numpy as npnan = np.nanactions = [[0, 1, 2], [0, 2], [0]]P = np.array([[[1.0, 0.0, 0.0], [0.2, 0.8, 0.0], [0.5, 0.5, 0.0]],[[0.8, 0.2, 0.0], [nan, nan, nan], [0.0, 0.0, 1.0]],[[1.0, 0.0, 0.0], [nan, nan, nan], [nan, nan, nan]],])R = np.array([[[20., 0.0, 0.0], [0.0, 0.0, 0.0], [-10., -10., 0.0]],[[40., 30., 0.0], [nan, nan, nan], [0.0, 0.0, -10.]],[[70., 0.0, 0.0], [nan, nan, nan], [nan, nan, nan]],])

Now we have an MDP environment!

Let’s use the Q-Value Iteration Algorithm to get Q*(s, a), which contains the score of action a at state s.

  • We use -inf to represent the actions that we can’t take at the state s.
  • Initialize Q*(s, a) with zeros.
  • For each iteration, apply the Q-Value formula to each transition from state s to state s’, taking action a, and update Q*(s, a) with the new result.
Q = np.full((3, 3), -np.inf)
for s, a in enumerate(actions):
Q[s, a] = 0.0

discount_factor = 0.99
iterations = 10
for i in range(iterations):
Q_previous = Q.copy()
for s in range(len(P)):
for a in actions[s]:
sum_v = 0
for s_next in range(len(P)):
sum_v += P[s, a, s_next] * (R[s, a, s_next] +
discount_factor * np.max(Q_previous[s_next]))
Q[s, a] = sum_v
print(Q)

Here is the Q we get:

These rows represent states, while the columns represent actions and the numbers represent the rewards of action a at state s.

Adam’s Long-Awaited Results

Here is what the demo tells us: The best course of action when Adam feels tired is to go get some sleep, then go to the gym and do a workout so he can be healthier, then go to work at peak efficiency.

Summary

In this series, we’ve learned Markov Decision Process, which is based on the Markov theory and Markov chain and supports Reinforcement Learning.

Understanding MDP is a necessary step in deepening your knowledge of RL — and now you’re well on your way! You should know how to:

  • Model a Markov Decision Process. Build the environment and set up ways to compute rewards from sequential decisions.
  • Discount rewards. Evaluate an action in RL based on the value of current and future rewards.
  • Implement an Optimal Policy Search. Using the Bellman Optimality Equation and Q-Value iteration algorithm.

In my next post, we will step further into RL by exploring Q-learning. Follow me here to make sure you don’t miss it!

--

--

dan lee
AI³ | Theory, Practice, Business

NLP Engineer, Google Developer Expert, AI Specialist in Yodo1