RL Course by David Silver (Lectures 8 & 9)

We finish the series of lectures by David Silver and move on to implementation exercises

Cédric Bellet
Biffures
3 min readApr 13, 2018

--

Lecture 8: Integrating Learning and Planning

  • A model parameterised by η is a (at least) tuple <Pη, Rη> of state transitions and rewards, such that:
  • We learn the model by counting the occurrences of transitions and rewards for all visits to each state action pair. For example, from a two-state problem, A, B, we can infer a model from experience.
  • Once we have a model, we can use a planning algorithm (value iteration, policy iteration, tree search…); however the recommended approach is to use the model only to generate samples and apply model-free RL to samples, e.g., MC control, Sarsa, Q-learning, etc. Using the A, B problem again, discussion about using the learnt model to create new samples.
  • Dyna (proposed by Rich Sutton): learn a model from real experience; learn and plan value function from real and simulated experience.
Dyna as explained in Reinforcement Learning: An Introduction (Rich Sutton)

Simulation-based search: leveraging planning for better action choices

  • Forward, simulation-based search: build a search tree with the current state s at the root, using a model of the MDP to look ahead (solve sub-MDP starting from now)
  • Monte-Carlo simple search: given a model M and a simulation policy π, for each current action a: (i) simulate K full episodes from current state s (ii) evaluate actions by mean return (iii) select action with highest value
  • Monte-Carlo Tree search (MCTS): given a model M and a simulation policy π: (i) simulate K episodes from current state s, (ii) build a search tree containing visited states and actions (iii) evaluate Q(s,a) by averaging the returns of episodes starting having (s,a) as parent node (iv) select action with highest value in search tree. More: https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
  • Temporal-Difference search: yes — also possible.
  • Dyna-2: the agent stores two sets of feature weights: long-term memory, short-term memory. Long-term memory is updated from real experience using TD learning. Short-term memory is updated from simulated experience using TD search. More: Sample-based learning and search with permanent and transient memories (Silver, Sutton, Müller, 2008)

Lecture 9: Exploration and Exploitation

  • Approaches: random exploration (e.g., ε-greedy); optimism in the face of uncertainty (prefer to explore states/actions with highest uncertainty), information state space (lookahead to see how information helps reward)
  • For most of the lecture, see chapter 2 of Sutton’s book (http://incompleteideas.net/book/bookdraft2017nov5.pdf).
  • Random exploration: ε-greedy, softmax, gaussian noise
  • Optimism in the face of uncertainty: optimistic initialisation, UCB, Thompson sampling
Optimisms under uncertainty. UCB gives the value of action a at best given its uncertainty. Thompson samples from the actions values available in a state and picks the best from the sample.
  • Information state space: Gittins indices, Bayes-adaptive MDPs
  • Can we quantify the value of information? how much reward would a decision make be prepare to pay in order to have that information? long-term reward vs. immediate reward. At each step, S_tilde summarises all information accumulated so far. Each action A causes transition to a new information state S’_tilde. Consider a flip coin challenge (Bernouilli problem): the information state is <counts pull of arm where reward was 0, count pulls where reward was 1>.
  • One successful approach to exploration/exploitation: construct a model of the MDP, for unknown or poorly estimated states, replace reward function with r_max (an exploration incentive), solve MDP

I won’t cover lecture 10 here, although interesting. The lecture video is here and the associated slides (the slides are not readable in the video) are here.

Additional resources

--

--