## Enabling fast adaptation with Model-based Meta-RL | Towards AI

# Model-Based Meta Reinforcement Learning

## Dive into a model-based meta-RL algorithm that enables fast adaptation

# Introduction

Much ink has been spilled on with model-free meta-RL in the previous article. In this article, we present a model-based meta-RL framework, proposed by Nagabandi&Clavera et al., that can adapt to changes of environment dynamics fastly. Long story short, this method learns a dynamic model that can fastly adapt to the environment changes and uses this model to do model predictive control(MPC) to take action. It is worth noting that the adaptive nature of the learned dynamic model is especially important not only in meta-learning but also in model-based RL since it alleviates the requirement of a globally accurate model, which plays an important role in model-based RL.

# Environment Assumption

We define a distribution of environment *ษ* as *๐(ษ)โ*. We forgo the episodic framework adopted by MAML-RL, where tasks are pre-defined to be different rewards or environments, and tasks exist at the trajectory level only. Instead, we consider each timestep to potentially be a new โtasksโ, where any detail or setting could have changed at any timestep. For example, a real legged millirobot unexpectedly loses a leg when moving forward as the following figure shows

We further assume that the environment is locally consistent in that every segment of length *i-j*โ has the same environment. Even though this assumption is not always correct, it allows us to learn to adapt from data without knowing when the environment has changed. Due to the fast nature of adaptation (less than a second), this assumption is seldom violated.

# Problem Setting

We formulate the adaptation problem as optimizing for the parameters of learning procedure ๐, ๐โ as follows:

where *๐_ษ(t-M, t+K) *corresponds to trajectory segments sampled from previous experiences, *u_๐ *is the adaptation process that we will define later, and *L* denotes the dynamics loss function, which is the mean squared error of changes in state โ(see the official implementation here):

Intuitively, by optimizing Eq.(1), we expect the agent to do well in the next K steps(*L(๐_ษ(t, t+K)) is minimized*) after the agent adapts the model according to the past M transitions. It is essential to realize what Eq.(1) is doing in order to understand this algorithm. If you are still uncomfortable with it, please read the MAML part of this article before moving on.

Also, notice that in Eq.(1), we put all data in a single dataset ๐ทโ instead of maintaining one dataset for each task since we want to do fast adaptation rather than trajectory-level meta-learning here.

# Model-Based Meta-Reinforcement Learning

Nagabandi&Clavera et al. introduce two approaches to solve Eq.(1). One is based on gradient-based meta-learning, the other is based on recurrent models. Both share the same framework and only differ in network architecture and optimization procedures. In fact, since they orthogonally emphasize different parts of the framework, they may be combined to form a more powerful method in the end.

## Gradient-Based Adaptive Learner

Gradient-Based Adaptive Learner(GrBAL) uses a gradient-based meta-learning to perform online adaptation; the update rule is prescribed by gradient descent:

Here the learnable parameter ๐ denotes the step sizes at adaptation time. This method bears much resemblance to MAML with an improvement that it learns the step size as Li et al.[4]suggested.

## Recurrence-Based Adaptive Learner

Recurrence-Based Adaptive Learner(ReBAL) utilizes a recurrent model, which learns its own update rule through its internal structure. In this case, ๐ and *u_๐*โ correspond to the weights of the recurrent model that update its hidden state. For more information, please refer to RLยฒ.

# Algorithm

If youโre already familiar with model-agnostic meta-learning(MAML) and model predictive control(MPC), there is nothing new here. We learn an adaptive dynamics model through Algorithm 1(line 8โ14), and then at each environment step, the agent first adapts the model(line 3 in Algo2) and then perform MPC(line 4 in Algo2) to take actions as shown in Algorithm 2.

# Experimental Results

The following video demonstrates the effectiveness of the algorithm

# References

- A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, & C. Finn, โLearning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning,โ
*ICLR*, pp. 1โ17, 2019. - C. Finn, P. Abbeel, and S. Levine, โModel-agnostic meta-learning for fast adaptation of deep networks,โ
*34th Int. Conf. Mach. Learn. ICML 2017*, vol. 3, pp. 1856โ1868, 2017. - Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel, โRLโ: Fast Reinforcement Learning via Slow Reinforcement Learning,โ
*ICLR*, pp. 1โ14, 2017. - Z. Li, F. Zhou, F. Chen, and H. Li. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning