Book Review — Reinforcement Learning: An Introduction

Dan Saunders
Apr 23 · 5 min read


Authors: Richard S. Sutton and Andrew G. Barto (go UMass!)

Edition: 2nd

Publication date: November 13, 2018

Pages: 552

My RL Background

Reinforcement learning (RL) was on the periphery of my university studies for quite some time. We did some exploratory work on framing supervised learning as an RL problem in spiking neural networks (SNNs), and ending up writing a paper on converting neural network-parameterized RL policies to SNNs in the BINDS lab.

SNNs seem especially well-suited to reward-based learning, as there are plenty of ideas to draw on from theoretical and experimental neuroscience that suggest that neural circuits learn in the presence of global neuromodulatory signals. These signals can be thought of as reward or reinforcement. There’s some good work on doing RL with SNNs (see e.g., here and here), but much work remains to be done before they can be used for RL in complex environments.

Book review

I can whole-heartedly recommend Reinforcement Learning: An Introduction. I started on the 1st edition in late 2018, realized a 2nd edition had been released, and switched over. Consequently, I missed some of the new material in the early parts of the new version, but I’m planning to read up on those sometime soon.

The writing and pacing of the book is extremely well done. I was especially impressed by the authors’ ability to tie in concepts from every part of the book throughout, which made for a great learning experience. Not only was I able to grasp the individual concepts as they were introduced, but felt I understood how they were related to the bigger picture. There are enough details in the book to implement many of the standard algorithms in RL, and the ideas behind them are explained succinctly and carefully related to each other as they’re introduced. This allows the reader to visualize a “space” of reinforcement learning algorithms / approaches, and to see where the gaps in our knowledge lie.

Part I describes the RL problem: an agent seeks to maximize the cumulative (discounted) reward from an environment over time. At each timestep (a parameter of the agent-environment formulation), the agent receives an observation of the environment, selects an action conditioned on it (and on the history of their interaction), and obtains a reward. This agent-environment interaction is often formulated as Markov Decision Process (MDP), “…a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.” There are other such formalisms, like partially observed MDPs, but the book focuses on MDPs.

The authors describe the dimensions along which RL algorithms vary and how they affect learning: n-step (temporal difference) vs. “infinite-step” (Monte Carlo) returns, exploration vs. exploitation, value iteration vs. policy optimization, eligibility traces and how they generalize n-step returns, model-based vs. model-free approaches, on-policy vs. off-policy training, prediction vs. control, etc.

There is a brief discussion on dynamic programming, which requires complete knowledge of environment dynamics. The authors use this as a basis for the discussion of value approximation and policy gradient methods, which have no such requirement of known dynamics.

Model-based methods build a model of the environment and uses it for planning many timesteps into the future, or simulates it to avoid querying a computationally demanding or dangerous real-life environment. Model-free methods learn directly from the environment, improving a value function to mirror the true value of states of the environment, or directly optimizing a policy to maximize cumulative discounted reward.

Temporal difference (TD) learning is a central component of reinforcement learning, and many parallels are drawn between RL algorithms (engineering), learning in animals (behavioral psychology), and learning in neural circuits (neuroscience). TD learning “…refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function.” TD learning stipulates that error signals, or, more accurately, reward prediction error signals, conveys a mismatch in expectation vs. reality, thus enabling a learning algorithm to make adjustments to reduce this mismatch.

On-policy algorithms collects experience with the policy being evaluated, while off-policy algorithms learn the value of a “target policy” independently of the “behavioral policy”. The behavioral policy is often similar enough to the target policy such that importance weighting can be used to apply experience gathered by the former to make parameter updates to the latter.

Machine learning methods are employed in the “Approximate Solution Methods” section, where features of observations from an environment are computed by a ML model or constructed by a human expert. These are used in cases where the environment’s observation space is combinatorially large, from which it’s hopeless to find an optimal policy, and our goal instead is to find a good approximation. For example, the recent success of RL on Atari game-playing relies on down-sampling, stacking, and processing in-game video frames with a convolutional neural network before actions can be selected.

Policy gradient methods optimize parametric policies to maximize a scalar performance measure (often the cumulative discounted reward) by following the gradient of the performance measure with respect to policy parameters. Actor-critic methods learn both a policy (actor) and an approximation to the value function (critic), the latter of which is used as a baseline to reduce variance in policy gradient updates.

The exploration vs. exploitation dilemma refers to the trade-off between exploring one’s environment in order to collect information and reduce uncertainty, and exploiting one’s knowledge of the environment to maximize reward. Simply put, a RL agent’s knowledge of its environment could first be maximized, and then exploited for garnering maximum reward. This process can be iterated until a good (approximately optimal) policy is found. Exploration can be accomplished by following a policy with high entropy (e.g., actions sampled from a uniformly random distribution), or by following an intrinsic “curiosity-driven” reward; there are many possible methods. Exploitation consists of choosing actions that lead to the highest expected reward.

Coming from a bit of a neuroscience background, I particularly enjoyed the chapters on psychology and neuroscience, although the latter is much more compelling than the former. The Classical and operant conditioning learning paradigms of behavioral psychology are related to prediction and control in reinforcement learning, respectively. It is hypothesized that dopaminergic neurons are responsible for deliverable a far-ranging reward prediction error signal, suggesting a biological implementation of temporal difference learning.

The book is comprehensive, and I can’t hope to cover everything here. In sum, I’m glad to have had this as a resource when jump-starting my foray into research on and application of reinforcement learning. Give it a read and let me know what you think!

Dan Saunders

Written by

MSc student in computer science at UMass Amherst. Likes machine learning and brain analogies.