RL Course by David Silver (Lectures 1 to 4)

A summary of 15 hours into reinforcement learning

Cédric Bellet

Follow

Published in

Biffures

5 min readMar 18, 2018

--

I continue my quest to learn something about reinforcement learning in 60 days (this is day 20), with a 15 hour investment in Deepmind’s David Silver’s course on Reinforcement learning, which itself draws from Sutton and Barto’s 1998 book, Reinforcement Learning: an Introduction.

Lecture 1: Introduction to Reinforcement Learning

Andrey Markov (1856–1922) developed the technique now known as Markov chains in the early 20th c. to examine the pattern of vowels and consonants in Alexander Pushkin’s novel Eugen Onegin (First Links in the Markov Chain, Hayes, 2013)

A state S is a Markov state if it contains all useful information from the history (see Markov property, also referred to as “memorylessness”).
Value is the prediction of a series of rewards, and is calculated as the expected sum of discounted rewards — similar to the concept of present value in finance.
Model: P is the matrix storing environment probabilities to get from one state to all others; R is the matrix storing all immediate rewards for moving from one state to all other states
Policy: the agency — the right to choose which actions to take in each step
Prediction: evaluate the future given a policy; control: optimise the future, finding the best policy
Reinforcement learning: environment unknown, agent interacts, improves policy; planning: model is known, agent performs computations with the model

Lecture 2: Markov Decision Process

Richard Bellman (1920–1984) developed dynamic programming during his time at RAND, a research organisation working with the US government on defence and non-defence matters

A Markov Reward Process (MRP) is a tuple <S, P, R, γ> where S is a set of states, P the transition matrix, R the reward matrix, γ the discount factor
A Markov Decision Process (MDP) adds agency: <S, A, R, P, γ>, where A is a set of actions that can be taken in each state
A policy π is a distribution of actions given a state s
The state-value function is redefined as as the expected present value of rewards in state s given policy π
The action-value function is defined as the expected value of rewards from state s given a policy π, taking an action a
Bellman equation for MDPs: the state-value function in state s is the reward from leaving state s plus the values in all possible future states weighted by their probability of happening P and with discount factor γ (see Bellman equation). Actions are factored in through policy π.

Optimality: a policy is optimal if the state-value function for that policy is at least as good as other policies, for every state. There exists an optimal policies exist in MDPs that is better than or equal to all other policies, and achieve optimal value and optimal action-value function
Optimal solutions for MDPs: no closed form solution (in general), but at least four iterative methods: (i) value iteration, (ii) policy iteration, (iii) Q-learning, (iv) Sarsa

Lecture 3: Planning by dynamic programming

Dynamic programming assumes full knowledge of the MDP <S, A, R, P, γ>. Evaluation: Given a policy, can we determine the value function? Control: Amongst all policies, what is the best possible policy?
Policy evaluation: start with a value function, then iterate over all states to calculate the new value function in single sweeps using the Bellman equation. Repeat until convergence.
Control with policy iteration: evaluate a policy π. Improve the policy by acting greedily π’ = greedy(v_π). Example: 10x10 gridworld, car park transfer example.
Control with value iteration: evaluate a policy π. At convergence, v=v* — we can extract the policy

An eloquent summary from https://stackoverflow.com/questions/37370015/what-is-the-difference-between-value-iteration-and-policy-iteration (answered by zyxue)

Lecture 4: Model-free prediction

Stanislaw Ulam (1909–1984) got the intuition for Monte Carlo methods while playing solitaire after a brain surgery, and reflecting on how ENIAC could help him understand the game’s probabilities numerically

Lecture 4 focuses on evaluating a policy in an unknown MDP problem; lecture 5 will explain how to find an optimal policy for unknown MDP problems. For evaluation/prediction, three methods: (i) Monte Carlo (ii) Temporal-Difference analysis (iii) TD(λ)
Monte Carlo: first-visit and every-visit Monte Carlo methods are equally useful; record the value for each path followed under a policy, then average over the number of visits to get an estimate of the value function in a given state. We can use incremental means to update the value function over iterations:

A mean, calculated as a function of latest value and previous mean

This is the inspiration for an update of the value function based on the latest sample value Gt

Temporal-Difference: learns from incomplete episodes, by bootstrapping (updates a guess towards a guess). In TD learning, the value of the next step helps understand value of a full path: “no need to wait until you crash to know it is going to be bad.”

TD exploits the Markov structure (we recognize the use of the Bellman equation).

TD(λ) is a method proposed by Richard Sutton in 1988 and generalizes Monte-Carlo and the above TD example by offering a spectrum between the two:

TD(λ) “assign credit by means of the difference between temporally successive predictions” (Sutton, 1988)

Eligibility traces, forward view, backward view: more in Sutton and Barto’s book, Reinforcement Learning: an Introduction, which contains the details to everything discussed so far in these lectures.

For reference, a link to the book is provided below, from Sutton’s own website:

Source: http://incompleteideas.net/book/bookdraft2017nov5.pdf