Enabling fast adaptation with Model-based Meta-RL | Towards AI
Model-Based Meta Reinforcement Learning
Dive into a model-based meta-RL algorithm that enables fast adaptation
Much ink has been spilled on with model-free meta-RL in the previous article. In this article, we present a model-based meta-RL framework, proposed by Nagabandi&Clavera et al., that can adapt to changes of environment dynamics fastly. Long story short, this method learns a dynamic model that can fastly adapt to the environment changes and uses this model to do model predictive control(MPC) to take action. It is worth noting that the adaptive nature of the learned dynamic model is especially important not only in meta-learning but also in model-based RL since it alleviates the requirement of a globally accurate model, which plays an important role in model-based RL.
We define a distribution of environment ɛ as 𝜌(ɛ). We forgo the episodic framework adopted by MAML-RL, where tasks are pre-defined to be different rewards or environments, and tasks exist at the trajectory level only. Instead, we consider each timestep to potentially be a new “tasks”, where any detail or setting could have changed at any timestep. For example, a real legged millirobot unexpectedly loses a leg when moving forward as the following figure shows
We further assume that the environment is locally consistent in that every segment of length i-j has the same environment. Even though this assumption is not always correct, it allows us to learn to adapt from data without knowing when the environment has changed. Due to the fast nature of adaptation (less than a second), this assumption is seldom violated.
We formulate the adaptation problem as optimizing for the parameters of learning procedure 𝜃, 𝜓 as follows:
where 𝜏_ɛ(t-M, t+K) corresponds to trajectory segments sampled from previous experiences, u_𝜓 is the adaptation process that we will define later, and L denotes the dynamics loss function, which is the mean squared error of changes in state (see the official implementation here):
Intuitively, by optimizing Eq.(1), we expect the agent to do well in the next K steps(L(𝜏_ɛ(t, t+K)) is minimized) after the agent adapts the model according to the past M transitions. It is essential to realize what Eq.(1) is doing in order to understand this algorithm. If you are still uncomfortable with it, please read the MAML part of this article before moving on.
Also, notice that in Eq.(1), we put all data in a single dataset 𝐷 instead of maintaining one dataset for each task since we want to do fast adaptation rather than trajectory-level meta-learning here.
Model-Based Meta-Reinforcement Learning
Nagabandi&Clavera et al. introduce two approaches to solve Eq.(1). One is based on gradient-based meta-learning, the other is based on recurrent models. Both share the same framework and only differ in network architecture and optimization procedures. In fact, since they orthogonally emphasize different parts of the framework, they may be combined to form a more powerful method in the end.
Gradient-Based Adaptive Learner
Gradient-Based Adaptive Learner(GrBAL) uses a gradient-based meta-learning to perform online adaptation; the update rule is prescribed by gradient descent:
Here the learnable parameter 𝜓 denotes the step sizes at adaptation time. This method bears much resemblance to MAML with an improvement that it learns the step size as Li et al.suggested.
Recurrence-Based Adaptive Learner
Recurrence-Based Adaptive Learner(ReBAL) utilizes a recurrent model, which learns its own update rule through its internal structure. In this case, 𝜓 and u_𝜓 correspond to the weights of the recurrent model that update its hidden state. For more information, please refer to RL².
If you’re already familiar with model-agnostic meta-learning(MAML) and model predictive control(MPC), there is nothing new here. We learn an adaptive dynamics model through Algorithm 1(line 8–14), and then at each environment step, the agent first adapts the model(line 3 in Algo2) and then perform MPC(line 4 in Algo2) to take actions as shown in Algorithm 2.
The following video demonstrates the effectiveness of the algorithm
- A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, & C. Finn, “Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning,” ICLR, pp. 1–17, 2019.
- C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” 34th Int. Conf. Mach. Learn. ICML 2017, vol. 3, pp. 1856–1868, 2017.
- Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel, “RL: Fast Reinforcement Learning via Slow Reinforcement Learning,” ICLR, pp. 1–14, 2017.
- Z. Li, F. Zhou, F. Chen, and H. Li. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning