Key Takeaways 1 (Bandits, MDPs, and Notations) From Sutton and Barto RL Textbook

What you should know Multi-armed bandits and Finite Markov Decision Processes

Published in

MITB For All

13 min readAug 25, 2023

This is the first part of a series written to complement the seminal Reinforcement Learning (RL) textbook by Sutton and Barto.

Many years ago, when I was required to use RL for my first job, I started learning from this book. It is the textbook. You will see university courses built around it, and rightly so.

Years have passed, but the fundamental concepts remain unchanged. There will always be some new state-of-the-art publications, but it will be built on these foundational ideas, which will remain evergreen.

Why This Series?

This series is written as I check through my slides against the textbook before delivering it ‘live’ to a classroom of 40+ Masters students, mostly from the AI-track. I figured that while the ideas are fresh in my mind, it would be fruitful to write a ‘highlights’ version of things that come to mind back home at night.

This is in no way a replacement or summary of the textbook; my aim is to recap key takeaways that I deem necessary for a good foundation in RL. At the same time, I added my comments on the interpretations for you to cross-reference with your own understanding.

You may think of it as a progress checkpoint — once you are comfortable here, you should be confident of diving deeper into the subsequent chapters.

Things to Note

There are two versions — ‘in-progress’ version (which has 300+ pages) or the complete version (which has 500+ pages). The first few chapters, under “I Tabular Solution Methods”, are essentially the same. I worked closely with the ‘in-progress’ version, and it includes my own notes when studying it myself. With fewer pages, I also deem this version more palatable, and hence, I will be making reference to this.

Knowledge of some latex is assumed (just the math-related stuff; you don’t need to worry about formatting). The very basics like subscript, superscript, summation, and Greek letters will be sufficient. You may refer to Wikibooks for a reader-friendly guide, or simply copy the latex to Overleaf.

Multi-armed bandits (Chapter 2)

In this chapter, you will not be exposed to the technical concepts such as environment, rewards, and returns (yet). Instead, you will be introduced to the idea of stochasticity and action selection.

The bandit is basically a slot machine — it takes money away from a gambler. Rather than considering multiple slot machines next to each other, we imagine a single slot machine with multiple levers (arms), each associated with its own reward distribution. Instead of choosing between different machines, we now choose which lever to pull on a single machine.

After reading chapter 3, you should have an appreciation of describing the scenario in terms of RL language.

Single State

The bandit problem may be deemed the most basic case. There is just one single state (sometimes it is described as stateless, because we can drop the state entirely). It involves choosing one out of a number of possible actions.

Each lever has its own reward distribution, which we are trying to estimate by sampling (i.e., pull the lever and observe the reward). We do not know the true action value of each lever because if it were known, the logical decision is to select that particular action (i.e., lever) every time — there would be nothing to learn.

The action value of each lever is unchanged by our previous actions as well as their observed rewards. Just because you have been getting poor payouts at some level does not mean that it is due for a big win soon. Also, knowledge of the previous state-action pairs is irrelevant (we do update the action values from the experience, though); we have only a single state, and it is like a fresh start each time.

You may notice there is not a single mention of ‘Markov’ in Chapter 2! It will be introduced in Chapter 3.

Equations

You will see an equation for Q_{t}(a) on page 33 (section 2.2). N_{t}(a) is the number of times action `a` was taken, and Q_{t}(a) is simply an average of all observed rewards corresponding to that particular action.

The authors talked about the greedy method, which is to select the action that corresponds to the highest Q (i.e., highest action value). To incorporate exploration, the notion of ε-greedy is introduced — there is a small probability that a random action is taken.

You will see another equation on page 37 (section 2.3). In plain English, it is sufficient to store the total rewards and the number of times each action is taken. We can use these to update the average, the individual rewards are not required. This saves memory and computation time.

Nonstationary extension

In section 2.4, we hear about the concept of a non-stationary problem. It means a problem whose action values or transition probabilities change over time. An exponential moving average is suggested for such cases, where recent observations are more heavily weighted.

Exploration

Moving beyond ε-greedy, we learn about Optimistic Initial Values (section 2.5). Here, we assume all action values are very high. Then, simply by adopting a greedy approach, all actions will inevitably be tried — i.e., exploration happens. Sutton and Barto warned though, that “it is not well suited to nonstationary problems because its drive for exploration is inherently temporary.”

You should be clear about the concept of UCB in section 2.6. In essence, we select actions for their potential to have high action values. The bound is larger when that action is less frequently selected because we do not yet know enough about it. Rather than selecting actions greedily based solely on Q_{t}(a), a bound, which represents the uncertainly and hence upside potential, is added.

Rounding up Chapter 2

For my course, I will be skipping Gradient Bandits (section 2.7) and Contextual Bandits (section 2.8). The key idea in section 2.7 is to learn preferences for different actions. When the reward is higher than expected, we increase the preference, and vice-versa. This is done with gradient ascent. Note that preference is not synonymous with policy π, though you can apply a softmax on the preferences to determine the respective probabilities.

Section 2.8 talks about different ‘bandit tasks’, where the ideal actions would differ depending on the context. Sutton and Barto gave the example of a slot machine changing its display color as it changes its action values. I view it as a preamble to dealing with different states.

Finite Markov Decision Processes (Chapter 3)

You should be clear about the definitions and have an idea of what you would define as an ‘agent’ vs ‘environment’, at least for the toy problems.

Never underestimate toy problems. These are where you test your RL algorithms and implementations, to see whether the benchmark performance can be achieved.

Sutton and Barto has a helpful `Summary of Notation` (pages xiii to xiv). You should appreciate the significance behind each notation’s meaning; these math symbols are there to help us and make things clear.

Don’t sweat over the bioreactor or the robots in examples 3.1 to 3.3.

Rewards (R) vs Returns (G)

You should be very clear about the difference between these two.

Formulating a suitable reward is a challenging problem in the ‘real world’. We often have multiple objectives in life—usually, we want to maximize something (or things), subject to some hard constraints, while minimizing certain costs. They may not be quantifiable in terms of each other. It is rarely the case that we could say 1 unit of X is worth 2 units of Y or 3 units of Z.

That is beyond the scope of this course. Don’t worry about it for now. As a student, the toy problems you will deal with would have clearly defined rewards (and lots of benchmarks).

I want to emphasize that the reward is your indication to the agent of what you want it to achieve and not how.

Episodic and Continuing Tasks

What is an episodic task? It is a task that terminates, upon which no further actions are taken, and no additional rewards exist. Section 3.4 introduces the concept of “a special absorbing state that transitions only to itself and that generates only rewards of zero.”

This allows equations to be expressed in a generic form, with summations up to infinity.

Markov Property and MDP

We’ve now come to section 3.5!

As explained in one sentence, the Markov property is where the current state contains all information that matters for the future; how this state came about does not matter.

See the example below. Suppose we have four particles, starting at positions (0,0), (0,1), (1,0) and (1,1). Each has their own unique trajectory. Eventually, all ended up at position (0.5,0.5), in this case by chance since it is simply a random walk. (The ending position is near to but not exactly at the same point, it is just for illustration.)

In a Markov environment, all four states are identical when the particle is at (0.5,0.5) — it does not matter how the particle got to that location, what counts is that it is now there.

Created by author using matplotlib and numpy.

A potential confusion is: “Are you saying all past observations are irrelevant?”

No. The observations of past trajectories helped us get an idea of the environment transition and potential rewards associated with a state-action pair. (Otherwise, why bother collecting experience through sampling or simulations?) It is just that under our current state, the sequence of what led to here does not matter.

Often, this is an approximation, but a fair one to make. Sutton and Barto wrote that, ideally, we want “a state signal that summarizes past sensations compactly, yet in such a way that all relevant information is retained. ... all that matters is in the current state signal”

If the sequence (of states, actions, and rewards) leading to our current state is important in the decision-making process, we can expand our state representation to include information from (t-1), (t-2), etc.

Moving on to section 3.6, an MDP is an RL task that satisfies the Markov property. Sutton and Barto wrote that “they are all you need to understand 90% of modern reinforcement learning.”

That’s a couple of years back. Still, we are good at keeping to this scope. I believe in two things:

All models are wrong. Some are useful. — George Box
It is better to be vaguely right than precisely wrong. — John Maynard Keynes

Let’s focus on learning useful things and being vaguely right first.

You need to be able to make sense of equations (3.6) to (3.9) in page 67. It is actually about the concept of weighted averages; you should understand and not memorize it.

Value functions

Know that equation (3.10) and (3.11) in section 3.7 are the definitions of state-value, and action-value under a given policy π. Looking at the math for v_{\pi}(s) and q_{\pi}(s,a), you should be able to explain it in plain English.

Do not proceed until you can. Otherwise, you would struggle with other equations that you see moving forward.

On page 71, you have the Bellman equation for v_{\pi}. It relates the value of a state to the values of its successor states, weighted upon the probabilities of actions and transitions.

This means that just by having the values of states immediately preceding a terminal state (which is very easy to compute; the value is simply the reward), we can iteratively determine the values everywhere else as long as we have knowledge of the transition dynamics.

If we do not have complete knowledge of the transitions, that is also fine — we can sample observations. However, note that this would mean we are not dealing with the ‘pure’ Bellman equation. More on sampling in subsequent chapters.

I like the Gridworld scenario (example 3.8). It is perfect for checking one’s understanding.

We can solve it iteratively, or analytically via linear algebra to solve a set of simultaneously equations. For the former approach, you will know how to do it when you see the example codes in subsequent articles of this series. For the latter, it can be done as follows.

First, just a recap on how to use np.linalg.solve.

a = np.array([
    [2, 3],
    [3, 8],
])
b = np.array(
    [11, 13],
)
x = np.linalg.solve(a, b)
print(x)

For the random policy, where the probability of moving in either of the four directions is equal, the value of each state can be expressed in terms of its immediate neighbors, and itself for states at the edges.

Set of equations to solve simultaneously. Image by author

This can be expressed in the form of Ax = B, as follows.

The same set of equations, re-written in the form Ax = B. Image by author

The above can be represented using two numpy arrays, as shown below.

def exact_solution():
    y = 0.9  # gamma
    a = np.zeros((25,25))
    b = np.zeros((25,))
    a[ 0, 0] = 1-0.5*y
    a[ 0, 1] = -0.25*y
    a[ 0, 5] = -0.25*y
    a[ 1, 1] = 1
    a[ 1,21] = -y
    a[ 2, 1] = -0.25*y
    a[ 2, 2] = 1-0.25*y
    a[ 2, 3] = -0.25*y
    a[ 2, 7] = -0.25*y
    a[ 3, 3] = 1
    a[ 3,13] = -y
    a[ 4, 3] = -0.25*y
    a[ 4, 4] = 1-0.5*y
    a[ 4, 9] = -0.25*y
    a[ 5, 0] = -0.25*y
    a[ 5, 5] = 1-0.25*y
    a[ 5, 6] = -0.25*y
    a[ 5,10] = -0.25*y
    a[ 6, 1] = -0.25*y
    a[ 6, 5] = -0.25*y
    a[ 6, 6] = 1
    a[ 6, 7] = -0.25*y
    a[ 6,11] = -0.25*y
    a[ 7, 2] = -0.25*y
    a[ 7, 6] = -0.25*y
    a[ 7, 7] = 1
    a[ 7, 8] = -0.25*y
    a[ 7,12] = -0.25*y
    a[ 8, 3] = -0.25*y
    a[ 8, 7] = -0.25*y
    a[ 8, 8] = 1
    a[ 8, 9] = -0.25*y
    a[ 8,13] = -0.25*y
    a[ 9, 4] = -0.25*y
    a[ 9, 8] = -0.25*y
    a[ 9, 9] = 1-0.25*y
    a[ 9,14] = -0.25*y    
    a[10, 5] = -0.25*y
    a[10,10] = 1-0.25*y
    a[10,11] = -0.25*y
    a[10,15] = -0.25*y
    a[11, 6] = -0.25*y
    a[11,10] = -0.25*y
    a[11,11] = 1
    a[11,12] = -0.25*y
    a[11,16] = -0.25*y
    a[12, 7] = -0.25*y
    a[12,11] = -0.25*y
    a[12,12] = 1
    a[12,13] = -0.25*y
    a[12,17] = -0.25*y
    a[13, 8] = -0.25*y
    a[13,12] = -0.25*y
    a[13,13] = 1
    a[13,14] = -0.25*y
    a[13,18] = -0.25*y
    a[14, 9] = -0.25*y
    a[14,13] = -0.25*y
    a[14,14] = 1-0.25*y
    a[14,19] = -0.25*y
    a[15,10] = -0.25*y
    a[15,15] = 1-0.25*y
    a[15,16] = -0.25*y
    a[15,20] = -0.25*y
    a[16,11] = -0.25*y
    a[16,15] = -0.25*y
    a[16,16] = 1
    a[16,17] = -0.25*y
    a[16,21] = -0.25*y
    a[17,12] = -0.25*y
    a[17,16] = -0.25*y
    a[17,17] = 1
    a[17,18] = -0.25*y
    a[17,22] = -0.25*y
    a[18,13] = -0.25*y
    a[18,17] = -0.25*y
    a[18,18] = 1
    a[18,19] = -0.25*y
    a[18,23] = -0.25*y
    a[19,14] = -0.25*y
    a[19,18] = -0.25*y
    a[19,19] = 1-0.25*y
    a[19,24] = -0.25*y
    a[20,15] = -0.25*y
    a[20,20] = 1-0.5*y
    a[20,21] = -0.25*y
    a[21,16] = -0.25*y
    a[21,20] = -0.25*y
    a[21,21] = 1-0.25*y
    a[21,22] = -0.25*y
    a[22,17] = -0.25*y
    a[22,21] = -0.25*y
    a[22,22] = 1-0.25*y
    a[22,23] = -0.25*y
    a[23,18] = -0.25*y
    a[23,22] = -0.25*y
    a[23,23] = 1-0.25*y
    a[23,24] = -0.25*y
    a[24,19] = -0.25*y
    a[24,23] = -0.25*y
    a[24,24] = 1-0.5*y
    
    b[ 0,] = -0.5
    b[ 1,] = 10
    b[ 2,] = -0.25
    b[ 3,] = 5
    b[ 4,] = -0.5
    b[ 5,] = -0.25
    b[ 9,] = -0.25
    b[10,] = -0.25
    b[14,] = -0.25
    b[15,] = -0.25
    b[19,] = -0.25
    b[20,] = -0.5
    b[21,] = -0.25
    b[22,] = -0.25
    b[23,] = -0.25
    b[24,] = -0.5

    x = np.linalg.solve(a, b)
    return x.reshape(5,5)

exact_solution()

The solution gives the value function corresponding to the random policy.

Optimality

The final subtitle for this post!

We first talk about the optimal value functions in section 3.8. The optimal state-value function is denoted v_{*}(s). It is obtained by choosing the policy π, which leads to the highest possible v_{\pi}(s). In a similar light, we have the optimal action-value function, denoted by q_{*}(s,a), by choosing π, which leads to the highest possible q_{\pi}(s,a).

Since the Bellman equation can be applied to any policy, it can certainly be applied to the optimal policy as well. That is what you see on page 76.

Do not make false assumptions when interpreting the following point by Sutton and Barto that “any policy that is greedy with respect to the optimal evaluation function v_{*} is an optimal policy.”

Here, v_{*} is not just any value function. It is the value function of the optimal policy — there is no better policy. Except for the most trivial cases, we will not know the exact v_{*}, and can only work towards an approximation of it.

Even if we do have v_{*}, or a very good estimate of it, it is still necessary to take a weighted average across the environment’s dynamics before we can select the best action.

If we have q_{*}(s,a), we can choose a greedy action directly, but again, what we are working with is an approximation.

Finally, in section 3.9, the authors talked about using parameterized functions (think neural networks) as an approximation. This is because using a tabular method, where each state-action pair has its own values recorded, is not feasible for most problems, even toy ones.

Conclusion

By now, you should understand the key points conveyed in Chapters 2 and 3 of the RL textbook by Sutton and Barto.

If you need any clarifications, you may post in the comments.

By next week, you should see a follow-up post on Chapters 4 and 5, which covers Dynamic Programming and Monte Carlo Methods.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.