RL Course by David Silver (Lectures 1 to 4)
A summary of 15 hours into reinforcement learning
Published in
5 min readMar 18, 2018
I continue my quest to learn something about reinforcement learning in 60 days (this is day 20), with a 15 hour investment in Deepmind’s David Silver’s course on Reinforcement learning, which itself draws from Sutton and Barto’s 1998 book, Reinforcement Learning: an Introduction.
Lecture 1: Introduction to Reinforcement Learning
- A state S is a Markov state if it contains all useful information from the history (see Markov property, also referred to as “memorylessness”).
- Value is the prediction of a series of rewards, and is calculated as the expected sum of discounted rewards — similar to the concept of present value in finance.
- Model: P is the matrix storing environment probabilities to get from one state to all others; R is the matrix storing all immediate rewards for moving from one state to all other states
- Policy: the agency — the right to choose which actions to take in each step
- Prediction: evaluate the future given a policy; control: optimise the future, finding the best policy
- Reinforcement learning: environment unknown, agent interacts, improves policy; planning: model is known, agent performs computations with the model
Lecture 2: Markov Decision Process
- A Markov Reward Process (MRP) is a tuple <S, P, R, γ> where S is a set of states, P the transition matrix, R the reward matrix, γ the discount factor
- A Markov Decision Process (MDP) adds agency: <S, A, R, P, γ>, where A is a set of actions that can be taken in each state
- A policy π is a distribution of actions given a state s
- The state-value function is redefined as as the expected present value of rewards in state s given policy π
- The action-value function is defined as the expected value of rewards from state s given a policy π, taking an action a
- Bellman equation for MDPs: the state-value function in state s is the reward from leaving state s plus the values in all possible future states weighted by their probability of happening P and with discount factor γ (see Bellman equation). Actions are factored in through policy π.
- Optimality: a policy is optimal if the state-value function for that policy is at least as good as other policies, for every state. There exists an optimal policies exist in MDPs that is better than or equal to all other policies, and achieve optimal value and optimal action-value function
- Optimal solutions for MDPs: no closed form solution (in general), but at least four iterative methods: (i) value iteration, (ii) policy iteration, (iii) Q-learning, (iv) Sarsa
Lecture 3: Planning by dynamic programming
- Dynamic programming assumes full knowledge of the MDP <S, A, R, P, γ>. Evaluation: Given a policy, can we determine the value function? Control: Amongst all policies, what is the best possible policy?
- Policy evaluation: start with a value function, then iterate over all states to calculate the new value function in single sweeps using the Bellman equation. Repeat until convergence.
- Control with policy iteration: evaluate a policy π. Improve the policy by acting greedily π’ = greedy(v_π). Example: 10x10 gridworld, car park transfer example.
- Control with value iteration: evaluate a policy π. At convergence, v=v* — we can extract the policy
Lecture 4: Model-free prediction
- Lecture 4 focuses on evaluating a policy in an unknown MDP problem; lecture 5 will explain how to find an optimal policy for unknown MDP problems. For evaluation/prediction, three methods: (i) Monte Carlo (ii) Temporal-Difference analysis (iii) TD(λ)
- Monte Carlo: first-visit and every-visit Monte Carlo methods are equally useful; record the value for each path followed under a policy, then average over the number of visits to get an estimate of the value function in a given state. We can use incremental means to update the value function over iterations:
- Temporal-Difference: learns from incomplete episodes, by bootstrapping (updates a guess towards a guess). In TD learning, the value of the next step helps understand value of a full path: “no need to wait until you crash to know it is going to be bad.”
- TD(λ) is a method proposed by Richard Sutton in 1988 and generalizes Monte-Carlo and the above TD example by offering a spectrum between the two:
- Eligibility traces, forward view, backward view: more in Sutton and Barto’s book, Reinforcement Learning: an Introduction, which contains the details to everything discussed so far in these lectures.
For reference, a link to the book is provided below, from Sutton’s own website: