Part 1: Reinforcement Learning — a comprehensive introduction

The second post of a series about the foundational mathematical concepts behind reinforcement learning

Published in

Comet

7 min readSep 18, 2018

Note: This article originally appeared on Luca Palmieri’s website at https://www.lpalmieri.com/posts/rl-introduction-01/ and was reposted here with his permission. The post is a part of a series introducing reinforcement learning that we highly encourage you to read as well!

In the previous post, we introduced:

We remarked that states and rewards are environment-related random variables: the agent has no way to interfere with the reward mechanism or modify the state transition resulting as a consequence of one of its actions.

Actions are the only domain entirely under the responsibility of the agent —

In other words, we are going to formally introduce the concept of policy.

A bit of notation

To simplify we shall assume, from now on, that the action space does not depend on the current state of the system — it does not make a substantial difference, but it allows the use of a cleaner notation.

This will turn out to be a useful notation in upcoming computations.

Decision rules

As we anticipated in the recap, we want to come up with a flexible way to specify (and study):

Using our brand-new history notation we can rewrite that expression as

d𝑡 is usually called a decision rule because it models how the agent reacts according to its current simulation experience.

Policies

A policy is a recipe underlying the behavior of our agent — namely, what it is programmed to do under the set of circumstances it is currently experiencing. In other words, a policy is how the agent is going to act at each time instant 𝑡 ∈ {1, …}.

Decision rules provide a quick and painless way to formally define it: a policy 𝛑 is nothing more than a sequence of decision rules, i.e.

Now that we a proper mathematical definition of policy we can give a rigorous formulation of the reward hypothesis, which we have already stated in the previous post:

That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

In particular, we can shed some light on the expression “expected value of the cumulative sum of a received scalar signal”.

Returns

So how do we measure the performance of a policy?

In a Reinforcement Learning context, every action is followed by a reward (positive or negative). To measure the overall performance of a policy we need to combine all these rewards together — this is usually called return in the literature.

The most common reward criteria is the one mentioned in the reward hypothesis: cumulative return.

where 𝑻 is the random variable denoting the end time of our simulation (e.g. the total number of moves in a chess game).

This definition is perfectly fine if 𝑻 is almost surely finite, but it becomes problematic for those simulations in which it makes sense to potentially have 𝑻 = +∞ — an exploration game, for example.

The issue can be promptly solved by introducing a discount rate 𝛾 ∈ [0,1) which can be used to define the discounted cumulative return:

State-value and history-value functions

Let:

Then “expected value of the (discounted) cumulative sum of a received scalar signal” is

which is usually called state-value function for the policy 𝝿.

The fancy “E”, 𝔼, stands for the probabilistic expectation of a random variable — look it up if this is the first time you encounter it (here’s a solid resource).
The 𝝿 subscript, instead, indicates that this expectation is computed under the hypothesis that the agent is following the policy 𝝿. What does this mean, mathematically?

Well, let’s unroll the definition!

Expected reward

First of all, we’ll start by computing the expected value of the first reward under the policy 𝝿:

The expectation of a random variable is nothing more than the sum of all its possible values weighted by their probability of being observed. Formally:

Using the law of total probability (link) we can decompose those probabilities with respect to the set of possible actions available to the agent (𝑨):

Thus, going back to the equation of the expected value of a random variable, we get:

You can clearly see the different contributions in the last equation above — the policy determines the probability to choose a particular action while the environment is responsible for the expected return awaiting the agent if it were to choose that specific course of action. In fact, there is no 𝛑 subscript under the last expectation!

Expected return

We can now derive an explicit formula for the expected discounted return under policy 𝛑.

First of all, the expectation is linear — i.e.

where 𝑨 and 𝑩 are random variables with finite expectation and 𝜶 and 𝜷 are real numbers.

Then:

Let’s take a step back and look at what we got here: ★ is not that different from the state-value function we have defined in the equation for the state-value function: the only difference is that we are computing the expected discounted cumulative return starting from 𝑡 = 2 instead of 𝑡 = 1 and also:

We can formalize this intuition by introducing the history-value function for policy 𝛑 at time 𝑡:

Thanks to equations for (1) expected reward,(2) expected discounted return, and (3)history-value function, we get our general formula for the expected cumulative discounted return under policy 𝛑:

This is a generalized version of what is usually called Bellman equation in the framework of Markov Decision Processes.

Bellman equation

Using the same techniques we have employed in the previous two subsections we can prove that the following equation holds for every 𝒕 ∈ {1,2,…}.

With these further hypotheses it can be proved (and we will, in a later episode) that equation becomes:

A closed relation!

We might then ask ourselves if there exists an optimal policy — a policy whose expected cumulative discounted return is greater or equal than the one associated with any other policy. Does it exist? Can we find it in the subset of markovian stationary policy? Is there a practical algorithm to do so?

These and many other related questions will be the main topic of our next episode.