Meta reinforcement learning could be particularly challenging because the agent has to not only adapt to the new incoming data but also find an efficient way to explore the new environment. Current meta-RL algorithms rely heavily on on-policy experience, which limits their sample efficiency. Worse still, most of them lack mechanisms to reason about task uncertainty when adapting to a new task, limiting their effectiveness in sparse reward problems.
We discuss a meta-RL algorithm that attempts to address these challenges. In a nutshell, the algorithm, namely Probabilistic Embeddings for Actor-Critic RL(PEARL) proposed by Rakelly & Zhou et al. in ICLR 2019, is comprised of two parts: It learns a probabilistic latent context that sufficiently describes a task; conditioned on that latent context, an off-policy RL algorithm learns to take actions. In this framework, the probabilistic latent context serves as the belief state of the current task. By conditioning the RL algorithm on the latent context, we expect the RL algorithm to learn to distinguish different tasks. Moreover, this disentangles task inference from action making, which, as we will see later, makes an off-policy algorithm applicable to meta-learning.
The rest of the article is divided into three parts. First, we introduce the inference architecture, the cornerstone of PEARL. Based on that, we argue the effectiveness of off-policy learning in PEARL and briefly discuss the specific off-policy method adopted by Rakelly & Zhou et al. Finally, we combine both of these components to form the final algorithm PEARL.
Inference network captures knowledge about how the current task should be performed in a latent probabilistic context variable Z, on which we condition the policy as 𝜋(a|s, z) in order to adapt its behavior to the task. In this section, we focus on how the inference network leverage data from a variety of training tasks to learn to infer the value of Z from a recent history of experience in the new task.
For a specific task, we sample a batch of recently collected transitions and encode each transition cₙ through a network 𝜙 to distill a probabilistic latent context 𝛹_𝜙(z|cₙ), typically a Gaussian posterior. Then we compute the product of all these Gaussian factors to form the posterior over the latent context variables:
The following figure demonstrates this process
Notice that transitions used here are randomly sampled from a set of recently collected transitions, which differs from transitions we later use to train the off-policy algorithm. The authors also experiment with other architectures and sampling strategies, such as RNN with sequential transitions, none of them exhibit superior performance.
We optimize the inference network q_𝜙(z|c) through the variational lower bound:
where R is the objective of some downstream task and 𝒩(0, I) is a unit Gaussian prior. One could easily derive this objective follows the derivation of 𝛽-variational autoencoder if we take R as the reconstruction loss. Rakelly&Zhou et al. found empirically that training the encoder to recover the state-action value function(with Q-function) outperforms optimizing it to maximize actor returns(with policy), or to reconstruct states and rewards(with a VAE structure).
Why Not Use A Deterministic Context?
The advantage of a probabilistic context is that it can model the belief state of the task, which is crucial for the downstream off-policy algorithm to achieve deep exploration. Deep exploration is particularly important in sparse reward setting in which a consistent exploration strategy is more efficient than random exploration. We refer the interested reader to Section 5 of Osband et al. 2016 for an illustrative example. The following figure compares these two contexts on a 2D navigation problem with sparse reward.
Combine Off-Policy RL with Meta-Learning
Modern meta-learning algorithms primarily rely on the assumption that the distribution of data used for adaptation will match across meta-training and meta-test. In RL, this implies that on-policy data should be used during meta-training since at meta-test time on-policy data will be used for adaptation. PEARL frees this constraint by offloading the burden of task inference from the RL method onto the inference network. Doing so, PEARL no longer needs to fine-tune the RL method at meta-test time and can apply an off-policy method at meta-training. In fact, the only modification to an off-policy RL method here is to condition each network on z and leave others as they are.
The official implementation of PEARL adopts Soft Actor-Critic(SAC) since SAC exhibits good sample efficiency and stability, and further has a probabilistic interpretation which integrates well with probabilistic latent contexts. Long story short, SAC consists of five networks: two state-value functions V and \bar V(\bar V is the target network of V), two action-value functions Q₁ and Q₂, and a policy function 𝜋; it optimizes these functions through the following loss functions
where Q=min(Q₁, Q₂) and \bar z indicates that gradients are not being computed through it. We refer to the interested reader to my personal blog for more details about SAC.
Now that we have already introduced all the essential components, it is time to put them together and present the whole algorithm.
There are several things worth attention:
- The context c is a tuple (s, a, r); it may also include s’ for task distributions in which the dynamics change across tasks.
- There is an implicit for-loop wrapping Lines 6&7 such that z is resampled after each trajectory. The same story goes with Lines 8&9. Also, notice that in many tasks we do not add data collected at Line 9 to the context buffer(
num_steps_posterioris zero in most configurations); This suggests that the context c in Line 12 is collected by policy conditioned on z from the prior distribution. Rakelly&Zhou et al. found this setting worked better for these shaped reward environments, in which exploration does not seem to be crucial for identifying and solving the task.
- The inference network q_𝜙(z|c) is trained using gradients from the Bellman update of the Q-network as we stated before.
Unlike previous methods, PEARL does not fine-tune any network at meta-test; it relies on the generalizability of the inference network to adapt new tasks.
The above figure demonstrates the task performance of different approaches on six continuous control environments. These locomotion task families require adaptation across reward functions (walking direction for Half-CheetahFwd-Back, Ant-Fwd-Back, Humanoid-Direc-2D, target velocity for Half-Cheetah-Vel, and goal location for Ant-Goal2D) or across dynamics (random system parameters for Walker-2D-Params). We can see that PEARL outperforms prior algorithms in sample efficiency by 20–100X as well as in asymptotic performance in these tasks
- Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efﬁcient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
- Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep Exploration via Bootstrapped DQN
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
- code: https://github.com/katerakelly/oyster