An overview on Meta-Q-Learning

apandastider

Published in

Machine Intelligence and Deep Learning

11 min readApr 27, 2022

This report is jointly written by Apan Dastider and Muhammad Rashedul Haq Rashed

YouTube presentation link : https://www.youtube.com/watch?v=9LTGz0uQ0lM&t=6s

A hypothetical conversation between our RL agent and RL scientist, John

“RL Agent: Hey John, let’s do a bookkeeping of my old mistakes while learning a task and plan ahead for more optimal actions. Just reward me for a best action and I will gradually develop my learning ability.
John: So, you are talking about Reinforcement Learning, right? We can implement it and build a robust and autonomous action-planning mind for you so that you do not need to depend on anyone.
RL Agent: But what if the task changes a little or I face something new while working, do I need to learn again from scratch. Cannot I just exploit my prior learning and jump-start learning from it for something unknown? It would save some time and resources of mine.
John: Do not worry, Agent. I have Meta-RL for you to handle that. You will be a super-AI with new technique titled as MQL.
RL Agent: MQL? That will be great!”

We started with this hypothetical conversation between an AI scientist, John, and a RL agent, which focuses on the current necessity of building such intelligent and autonomous systems which can not only produce promising performance on tasks introduced while learning, but also can efficiently adapt to a new task through exploiting previous experiences. In recent years, Reinforcement Learning (RL) and its neural networks based variant Deep Reinforcement Learning (DRL) methods have brought significant progress in solving problems ranging from videogames to real-world robotics tasks. RL can be simply defined as a process of incrementally developing a model of the system by learning from penalized mistakes as well as rewarded good actions. This seems to be quite similar to way our hypothetical RL agent wishing to be autonomous.

In recent years, DRL has produced excellent results in field of robotics. Several domains of robotics such as robotic manipulations or complex locomotion tasks have been easily solved through DRL. Unfortunately, DRL suffers severely from lack of generalization to newer tasks. In short, the learned rules or action-mapping policies for one task cannot be easily used for an unforeseen task domain or even slightly perturbed system dynamics. Remember, this has been the fear of our hypothetical RL agent when it first heard about RL from John. Moreover, real-world robotic environments are quite complex and fragile to system parameters, and it is nearly impossible to replicate all possible environment phases while learning rules of a specific task or a domain of tasks. Keeping all these facts in mind, John planned to implement Meta Reinforcement Learning method for the RL agent. The adaptation process in meta-RL happens with limited exposure to new system dynamics and through adjusting previous learned models in accordance with the new dynamics.

Before diving into more details about the meta-RL algorithms, let’s introduce an overview of RL. In RL framework, the learning agent operates in an environment and receives current situation of the surroundings which is often termed as ‘observation’ or ‘state space’. This state vector contains all necessary information about the workspace. Based on this observation, the agent takes certain ‘action’ suggested by an action-mapping policy π(s)→a ∈ A. The central target of RL controller is to develop this policy or strategic rules so that the agent can achieve pre-defined goals in its environment. Based on sampled action on current states, the agent either receives negative reinforcement (penalty) if it takes a bad action or positive feedback (reward) if it can execute an optimal action. This feedback mechanism guides the optimization procedures for learning the optimal policies and maximize reward in long run.

Functional Block Diagram of Reinforcement Learning

On-policy and Off-policy RL: Following [1], on-policy method tries to optimize the policy which has been used also for making decisions. For example, SARSA is an on-policy RL method. On the contrary, off-policy learning improves a target policy which is different than the behavior policy. Behavior policy is often used for exploration and creating experience buffer which is later used for improving the target policy. Q-learning is an off-policy RL algorithm. So, in short, for on-policy RL, target policy = behavior policy and for off-policy RL, target policy != behavior policy.

Now, we can concentrate on our today’s discussion of Meta Reinforcement learning and specifically a recent method titled as Meta-Q-Learning. This research paper has been presented as a conference paper in the International Conference on Learning Representations (ICLR), 2020. RL methods only learn action-mapping policies for a specific task and often fails to generalize over new tasks. As a resolution to this issue, recently there is an astronomical surge in meta reinforcement learning research. In simple words, meta-learning is introduced in RL frameworks where training tasks and test tasks are different but sampled same domain of tasks. Let’s look at a simple example of robotics grasping task as shown below. Here, a Franka Emika Panda Robotic manipulator learns to grasp geometrically and physically different types of objects. The objects may range from bottle to egg which require sophisticated grasping techniques. But, in test phases, we introduced objects such as glass or paint tools which possess considerably different shape than the training objects. Meta-RL empowers to the agent to adapt to this slightly unseen task objects and adjust its previous learned model while training for grasping these new objects.

A robotic problem solving through meta RL

Our hypothetical RL agent would be happy with meta-RL, I think.

We will now thoroughly discuss about Meta-Q-Learning (MQL) which is a new off policy algorithm for meta-RL. MQL depends on three very simple ideas:
I. An off-policy Q-learning with context variable-based representation of past experience can show competitive performance like state-of-the-art meta-RL algorithms.
II. A simple multi-task objective is utilized to maximize average rewards in all training tasks for meta-training action policies.
III. An importance weighting technique-propensity score estimation is used for efficiently recycling the past trajectories stored in a replay buffer when the robotic agent tries to adapt to a new task.

The authors in the paper empirically showed that an off-policy algorithm like TD3(Twin delayed deep deterministic policy gradient) modified with a context representation of the trajectory can perform very close to recent meta-RL algorithms such as MAML (Model Agnostic Meta Learning) and PEARL (probabilistic embeddings for actor critic RL). The result shows that it is not always mandatory for doing exhaustive meta-train policies for existing benchmarks.

Average returns on validation tasks compared for two prototypical meta-RL algorithms, MAML and PEARL, with those of a vanilla Q-learning algorithm named TD3 (Fujimoto et al., 2018b) that was modified to incorporate a context variable that is a representation of the trajectory from a task (TD3-context)

Problem Formulation: Let’s consider a Markov decision process denoted by,

where xₜ represents state vector, aₜ is the action vector and ξₜ is the uncontrolled noise in the system. Through dynamics fᵏ, the system evolves to a new state xₜ₊₁ after executing action and the agent receives a reward rᵏ(xₜ, u_θ (xₜ)). The action-value function qᵏ(x,u) defines how compatible action, u for state, x. This is defined as,

Given a task k, the goal of RL is to find that parametric representation of model which maximizes the long-run reward from the environment i.e.,

In algorithm like Deep Deterministic Policy gradient (DDPG) algorithms, this objective is solved through a coupled optimization problem where value function is parameterized by φ and policy is parameterized by θ,

And Temporal difference error is defined by,

In later discussions, all attention is given to the parameter space θ while φ will be optimized through TD learning. So far, we discussed about only one task handling resolution through RL. But we want to think beyond it and meta-RL facilitates the idea of handling new task by training on a large number of training tasks. Given 𝒟ₘₑₜₐ = Dᵏ where {k=1,…n} tasks, the learning objective of meta-RL is modified as,

where, lᵏₘₑₜₐ is the meta-training loss dependent on the method. So, we train the algorithm for k number of tasks and utilize this learning for adapting to a new task. In MQL method, the authors have presented a new way of implementing meta-RL for solving various robotics tasks.

Meta-Q-Learning (MQL) :
1. Meta Training Phase : MQL performs meta-training using the multi-task objective. In we set the meta-training loss as,

then parameters θₘₑₜₐ will be trained to maximize the average returns over all tasks from the meta-training set, not trained as gradient-based method solved in traditional meta-RL such as MAML. The authors have used an off-policy algorithm TD3 to solve for,

The TD3 uses two action-value function approximators for reducing the over-estimation bias. This technique is often referred as “double Q-learning”.
2. Designing Context Representation : Unlike traditional RL methodologies, here policies u_θ(x) and action-value function q_φ(x,u) depends also one addition variable, z This variable is termed here as the context variable which works as a hidden identifier of a task in meta-RL. This recurrent context variable zₜ depends on {(xᵢ, uᵢ, rᵢ)}_{i≤ t}. In implementation, this zₜ is computed as the hidden state at time t of a Gate Recurrent Unit (GRU) model. So, based on particular time-step and the performed task at that time-step, the value of zₜ changes. zₜ represents a latent representation for each trajectory chain. In short, this carries the information about a task the agent has experienced while solving the objective. This zₜ later was fed to a logistic classifier to calculate the probability of new trajectory, τ belonging to new task domain against meta-trained task domain.
3. Adaptation to a new task : We can now focus on how the agent can adapt its meta-trained policy θₘₑₜₐ to handle a new task Dⁿᵉʷ with few data. MQL optimizes the adaptation objective in two phases.
I. Phase 1 :

The quadratic penalty in second term keeps the new learning parameters close to previously learned parameter matrix. This reduces the variance of the model while adapting with few data.
II. Phase 2:

In phase 2, the meta-training replay buffer which holds trajectory of training tasks is exploited for getting a guided direction for the new task. Meta-training tasks 𝒟ₘₑₜₐ are disjoint from Dⁿᵉʷ but they are sampled from same task distribution as discussed earlier. So, through an importance weighting technique between the 𝒟ₘₑₜₐ and Dⁿᵉʷ, we can reweigh the meta-training transitions and the update rules. The authors have used here the propensity score term β(τ;Dⁿᵉʷ,𝒟ₘₑₜₐ) where τ is sampled from the meta training buffer. Now question arises on how to calculate it. Here, the hidden context variable associated with each trajectory functions as the features for a logistic classifier which is fitted on mini-batch of transitions from meta-training buffer and the new transitions. In simple words, this propensity metric measures similarity between new tasks and meta-training replay buffer. This similarity score helps to get jump-start learning for the new task.
Additionally, the λ term at the second term is tuned through ESS,
This ESS term is called the Normalized Effective Sample Size which is defined as the relative number of samples from the target distribution p(x) required to obtain an estimator with performance (say, variance) equal to that of the importance sampling estimator q(x). As shown in figure, ESS is close to one, when p(x) and q(x) are closely identical and vice-versa. This term relaxes the quadratic penalty if the new task is similar to meta training tasks.

Experiments and Results Analysis :
Six Continuous Control meta-RL benchmark tasks inspired by real robotics locomotion tasks from [2] :

Pseudo-Code 1 for first training for meta-learning phase and Pseudo-Code 2 for doing adaptation to new task with meta-trained policy:

Average validation return comparison on 6 control Benchmarks. Here used baselines are MAML[3], PEARL[2], RL2[4], ProMP[5]. In all environments except Walker-2D-Params and Ant-Goal-2D, MQL is better or comparable to existing algorithms in terms of both sample complexity and final returns. Here, statistical comparison done with data from 5 random seeds

Generalization Ability in out-of-distribution-tasks:

Fig. a shows that MQL significantly outperforms PEARL in disjoint tasks set. Fig. b shows the evolution of the λ and β(z) . We see that β(z) is always small which demonstrates that MQL automatically adjusts the adaptation in new task

(a) MQL vs TD3-context comparison, (b) The effects of adaptive β for better performance, (c) Effects of ESS in the performance evaluation

In the above figure, a comparison with MQL with TD3-context has been made to shown that MQL works better. The interesting part of the above figure is the effects of β and λ on learning performance of MQL. With an adaptive β, we see a better performance compared to MQL with zero β. Since, this β parameter is calculated over the hidden representation of the task variable. zₜ, this hidden, recurrent context variable has a promising effect on MQL. Basically, this representation helps to reweight the learning gradient to take larger steps for following an ascent direction in the solution space. Besides, we observed that λ also assists to produce a better performance than a fixed preset value of λ.

Conclusion :

This paper proposed a simplistic meta-RL approach based on 3 elementary, but unique ideas :
I. An off-policy Q-learning incorporated with context is adequate enough to produce comparable performance
II. Only maximizing average reward of training tasks is a promising approach which is more straightforward than state-of-the-art algorithms
III. Recycling data from stored replay buffer through an important weighting technique — propensity score estimation which propose a sample-efficient algorithm
•This paper also highlights an urge to create new and realistic robotics environments for further validation of meta-RL approaches.
•In future, we planned to validate this work through implementing into a real multi-agent collaborative robotic manipulators research setup for addressing adaptive action planning in real-time
References:
[1] Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press.
[2] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efficient off-policy meta reinforcement learning via probabilistic context variables. arXiv:1903.08254, 2019
[3] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
[4] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv:1611.02779, 2016.
[5] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Promp: Proximal meta-policy search. arXiv:1810.06784, 2018.

An overview on Meta-Q-Learning

Written by apandastider