RL Course by David Silver (Lectures 5 to 7)

This is day 45 of my 60-day reinforcement learning challenge, and we continue the series of lectures by Deepmind’s David Silver.

Cédric Bellet
Biffures
Published in
6 min readApr 13, 2018

--

The lectures draw from and complement Sutton and Barto’s book, Reinforcement Learning: An Introduction, which is available for free on Rich Sutton’s site, http://incompleteideas.net/.

Lecture 5: Model Free Control

  • We still operate in the context of an unknown Markov Decision Process, and we apply General Policy Improvement to find optimal policies.
  • ε-greedy policies follow the greedy policy with probability 1-ε and distribute ε chance to the other actions. They allow for continuous exploration of all states and actions. We can show that ε-greedy policies π’ with respect to q_π are improvements over initial policy π.
  • Greedy in the Limit with Infinite Exploration (GLIE): ε-greedy with probability ε/N to take a random action, where N is the number of visits in a given state. A GLIE Monte Carlo uses Monte Carlo and the GLIE ε-greedy
  • Alternatives to Monte Carlo. On-policy: SARSA (Rummery & Niranjan, 1994), off-policy: Q-learning (Watkins, 1989); hybrid: Expected SARSA
SARSA updates the action-value function using the value if the action it would take in the next state; Q-learning updates towards the best possible (greedy) action in the next state; Expected SARSA takes into account all likely actions given the policy, and as a result “moves deterministically in the same direction as SARSA moves in expectation” (Sutton & Barto)
  • Sarsa(λ), Q(λ): n-step SARSA uses the n-step Q-return; with n going into infinity, n-step SARSA is Monte Carlo. Like in lecture 4, we can generalize the n-step approach into a λ one by averaging over all n.
  • We discuss Forward-looking, Backward-looking Sarsa λ, and eligibility traces; but at this stage things get confusing and I go back to Sutton & Barto’s book.

Lecture 6: Value function approximation

  • Aside: by now, after following David’s course and reading bits of Sutton’s book, I think we ideally want a “Dyna, double Q learning, Expected SARSA(λ)” system, i.e. one that learns from both model and experience (Dyna), reduces bias by using two value functions (double learning), takes into account all possible paths (Expected) and all their future rewards, bootstrapped and real (λ approach).
  • In lecture 6 we discuss how to address situations where the state-space is too big to be solved in a tabular way — and we use value function approximation (e.g., ANN, Fourier bases) to solve those problems. We note v_hat, q_hat the approximators, and ϕ(s,a) the features of a state that serve as inputs to those functions; we focus on differentiable approximations. We can see table lookup as special case of value function approximation, so this lecture offers a generalization of previous knowledge.
  • State-value function approximation: if we know the actual value function (an ‘oracle’ tells us what it is), then we can use gradient descent to minimize the squared error between our approximator and the real value function. Without an oracle, we can use Monte Carlo and temporal difference to establish what the target might be. We substitute the oracle for a MC or TD target — or better, for an n-step TD, or TD(λ) target.
  • Action-value function approximation: same idea, but we substitute the target with a Monte-Carlo, SARSA, Q-learning, Expected SARSA, Expected SARSA(λ) or other policy-learning target. Learning the action-value function allows us to do control.
  • Experience replay: in order to be data efficient and break correlation in experiences, we store experiences (s, a, r, s’, a’) in a replay buffer, and train the approximator on a random sample from the replay buffer.
  • Deep Q-Networks (DQN): DQN was proposed in 2015 by Deepmind in a paper published in Nature (Human-level control through deep reinforcement learning, Mnih, Kavukcuoglu, Silver et al., 2015). DQN uses two action-value function estimators (not unlike double learning), experience replay, a TD(0)-type target, and stochastic gradient descent. The method is stable using neural networks.
DQN algorithm as described in Mnih, Kavukcuoglu, Silver et al., 2015
  • In the case of linear value function approximation, we can solve the least squares solution directly.

Lecture 7: Policy Gradient

  • Policy gradient reinforcement learning optimises a policy directly instead of estimating an action-value function from which we infer the policy. In the rest of the lecture 7, we parameterise policy π using parameters θ, and π(θ, s, a) describes the probability of following action a in state s, given parameters θ.
  • Optimising π(θ): To optimise π(θ), we define objective functions J which describe the policy’s quality which we want to maximise. In the table below, d(π(θ)) is the stationary distribution of the Markov chain under policy π(θ). To optimise J, we can use: hill climbing, simplex, genetic algorithms, gradient descent, conjugate gradient, etc.
J1 is used for episodic MDPs, the other ones are used in continuous MDPs
  • Computing the gradient of J: finite differences. We can approximate the ∇J by computing each (J(θ + ε)-J(θ))/ε, where ε is a small perturbation along one of the dimensions of J’s input space. Simple, noisy, inefficient (i.e., in cases with millions of parameters) — but sometimes effective.
  • Computing the gradient of J: using score functions. We exploit the likelihood ratio identity ∇f = f (∇f / f) = f (ln f), which we apply to π. We have ∇π = π ∇ln(π(θ)), and we call ∇ln(π(θ)) the score function, a quantity convenient to compute in some scenarios (see softmax and Gaussian policy below). We can show that for any differentiable policy π, for any policy objective function, the policy gradient is:
  • Softmax policy. Consider a policy in which probabilities are given by the softmax of actions weighted by linear combination of features ϕ(s,a)·θ; probability of action is π(a) = exp(ϕ(s,a)·θ) / Σ_a(exp(ϕ(s,a)·θ)); then the score function is ∇(ln(π)) = ϕ(s,a)-𝔼(ϕ(s,·)).
  • Gaussian policy. Consider a policy in which probabilities are given by Gausssian (aka normal) distributions 𝓝(μ(s), σ²), where variance σ² is given or parameterised, and we assume μ(s) = ϕ(s,a)·θ is a linear combination of features. Then ∇(ln(π)) = (a-μ(s))ϕ(s) / σ².
  • Monte-Carlo Policy Gradient (REINFORCE) is an algorithm proposed by Ronald J. Williams in 1992 (Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning), which uses the policy gradient theorem. As with any MC method, we wait until the end of an episode to update our parameters based on the final value. Here we know nothing of the action-value function Q, and we use the actual return v_t to estimate it.
REINFORCE algorithm (David Silver, lecture 7)
  • Actor-Critic algorithms combine policy learning (lecture 7) and action-value learning (lectures 5 and 6). The critic updates the action-value function, the actor updates the policy parameters. As in previous lectures, we can learn Q through TD(0), MC, TD(λ)… Actor-critic is an improvement over Q-learning in the sense that we do not follow an ε-greedy policy, but learn and more nuanced policy, ; is an improvement over policy gradient through the use of a learnt action-value Qw.
  • Baseline and advantage function: in the actor-critic update of θ, we subtract a value B from the action-value function: Δθ = α∇_θ(ln(π(θ)) Qw(s,a)- B). We can show that by setting B =V_π(s), we reduce variance in the learning without changing the direction of ascent. We call A_π(s, a) = Q_π(s,a)-V_π(s) the advantage function, and we interpret it as a measure of “how much better than usual is action a in state s”. We update the policy in the direction that will encourage more of that action, i.e. follow the policy showing the greatest advantage.
  • Simplified actor-critic. In the above update, we need to calculate both the action-value and the state-value functions. To simplify the algorithm, we estimate Q_π as: r + γV_π(s’) - V_π(s). We now only need to calculate the state value function.
  • Natural policy gradient, natural actor critic: more in A Natural Policy Gradient (Kakade, 2002). The premise observation is that standard gradient descent is dimensionally inconsistent, and natural gradient is a more complicated but dimensionally consistent gradient that results in “drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.”
  • More on policy gradient methods: http://www.scholarpedia.org/article/Policy_gradient_methods

--

--