Geek Culture
Published in

Geek Culture

Towards Meta RL

A machine learning model generally requires a large number of samples to train.

Humans, on the other hand, are able to learn skills and concepts a lot more efficiently. For example, a kid who has seen cats and dogs only a number of times is able to quickly tell them apart.

Even if you’ve seen a single sample you can recognize the same machine even if pixel by pixel it’s completely different.

People who know how to play tennis can learn how to play ping pong fairly easily. Meta-learning puts forward the idea that machine models could potentially be designed to have similar capabilities — like learning with just a few training samples.

Meta-Learning addresses two key challenges with building machine-learning models today:

A good meta-learning model is capable of well adapting or generalizing to new tasks that haven’t been seen before. Meta-learning is also known as model learning to learn.

the meta-learning problem

An optimal meta-learning model is trained over a variety of learning tasks and optimized for the best performance across a number of these tasks. Each task is associated with dataset D which contains feature vectors and true labels.

In this case, the optimal model parameters would be:

It’s the same as a normal learning task except that one dataset is used as one data sample.

Meta-supervised learning:

The way to go about learning or training such a learning procedure (f can be viewed as a learning procedure) is to have data that is in the form of different datasets. Each of these datasets has a number of different input and output pairs.

For each of these tasks, you assume you have at least k examples for D train and at least one example to be used to see whether your learner is generalized.

This view of meta-learning reduces the problem to the design and optimization of f. If we can design a good function that takes an input of some data and also optimizes this function, then we have a learning procedure we can apply to new tasks.

A popular example of this is the few-shot classification. This is where you’re given 1 example of 5 classes which act as your training data D train and the goal is to classify new examples as part of one of these 5 classes.

Dataset D is split into two parts — support set S for learning and a prediction set B for training or testing, D = <S, B>.

Let’s say you take 5 other image classes and then break it up into a train set and a test set for multiple sets of 5 classes, train your meta learner across these classes and take as input a training dataset and produce correct labels on each task test set.

Essentially, you’re learning how to learn the bottom two sets so you can generalize to data and classes that have been unseen.

You could be doing other learning problems. They don’t have to be classification problems. You could be trying to learn some regression problems or trying to predict the dynamics of robots for different tasks.

… isn’t everything meta-learning?

Our notions of meta-learning are evolving as we solve more complex tasks.

  • IMageNet
  • Few Shot Learning
  • Captioning

These are all technically meta-learning examples. Solving image captioning, mapping an image to a sentence that explains the image, for example, is also a form of meta-learning.

The question we need to ask ourselves is: can we explicitly learn from previous experiences that lead to efficient downstream learning?

That’s where reinforcement learning comes in.

RL has received a lot of attention in the AI space because it doesn’t require labelled data and has proven to be remarkably more effective in simulating environments and imitating the process of “human learning.” RL struggles with sample efficiency, generalization, and transfer learning. To address these drawbacks, researchers have been exploring ideas in meta-RL, in which learning strategies can quickly adapt to novel tasks by using experience gained on a large set of tasks that have a shared structure.

a concrete meta-rl problem

The goal of Meta-RL is to design agents that can adapt quickly and improve with additional experience.

Considering a classical control task like HalfCheetah, introduced by the physics engine MuJoCo. In this environment, you control a 2D cheetah robot and the goal is to make it run forward as fast as possible.

Brax HalfCheetah trained with PPO [6].

Now, let’s consider a slight modification of the environment that includes a goal speed. The goal of the agent is to run forward and reach this speed. This goal is only inferred by the rewards that are received. Because the agent doesn’t have any knowledge about the movement, environment, or objective, it acts like an untrained network that observes a state vector and sometimes (if it’s lucky) gets rewards.

With a goal speed v and using an actor-critic method with continuous control like PPO [6], the agent can learn to run forward at speed v. However, imagine we now want to consider a new goal- speed v’

The normal thing to do would be to send the agent back into the environment and retrain it. The problem is that the learning process is incredibly unpredictable and there’s little assurance about how quickly the agent would be able to learn a new policy.

In theory, though, the agent has information about movement and speed. Given a good and informative signal from the environment, we would like the agent to adapt quickly and update its policy to reach the new target speed. Designing agents like this is the main goal of meta-reinforcement learning.

defining and formulating the meta rl problem

A meta-RL model is trained over the distribution of MDPs. For testing, it would be able to learn to solve a new task quickly. The goal of meta-RL is ambitious, taking one step further toward general algorithms. We can meta-learn reinforcement learning tasks for example.

First, I think it’s important to define the key problems and differences.

Reinforcement Learning:

For reinforcement learning, you replace each of the rows with reinforcement learning problems instead of supervised learning problems. The goal is to learn a policy, pi that maps from states to actions using some state action reward next state transitions. The goal is to learn pi such that you maximize your expected rewards.

Meta reinforcement learning:

The way that we can turn this into a meta-reinforcement learning problem is by trying to learn from a small amount of data, such that there are k rollouts from some policy. We want to learn from this such that when given a new state, we can effectively predict a good action to take from that state.

And so you could, again, do this as some function or some learner f that takes this input (the training dataset) that it’s trying to learn from. It draws from its experience and current task as well as the current state to predict the corresponding action. The data that we’ll have will be a dataset of datasets with each data set corresponding to a different task.

The problem is also the design and optimization of this function f. We need to figure out how it’s actually going to learn from this data and how we’re going to optimize it to learn from this data. Another important part of the meta-reinforcement learning problem is how to collect appropriate data (D train) since this is in direct control in a reinforcement learning setting.

This part of the problem is learning to explore — figuring out how you should collect this train data and explore for a new task such that the data gives effective information for solving the task.

Essentially, this part of that problem is the learning to explore part like figuring out how you should collect this train data,

As a concrete example, one thing we could do is figure out how to solve a maze from a few examples.

By learning how to learn many other tasks:

You learn how to learn many other mazes. You expose the model to different experiences in mazes and then train it to solve this maze with a small amount of data.

method classes

black-box or context-based methods

We already talked about how we have this function that takes in a training dataset and a state and outputs the corresponding action.

One really simple way to do this is to essentially parameterize this learner with just a big neural network.

So for example, it could be a recurrent neural network that takes as input the data sequentially and outputs an action.

You might wonder how this differs from just doing reinforcement learning with a recurrent policy.

One change is that you pass rewards into this policy. Typically, if you just apply a recurrent policy, you may not be passing in rewards. The biggest difference is that the hidden state of your recurrent network is actually maintained across episodes within a task.

This allows you to actually adapt to a task across multiple episodes and maintain memory about that task and learn about that task across multiple episodes.

This function could be optimized with standard reinforcement learning methods (policy gradients, for example)

In the black box approach, the model reparametrizes the learner with the neural network — essentially adapts the learner given some new information. Some of the advantages of this type of network are that it’s very general and expressive.

A recurrent neural network can approximate any function under fairly mild assumptions which let us approximate many types of learning procedures.

There’s a variety of design choices in the architecture which gives some flexibility in terms of how you want it to actually process the data. The downside to this approach is that it’s a complex model that has a task to learn a very complex thing which is essentially learning from data. It has to actually be able to process this data and translate that data into something. Information about the task informs it about how to act in different environments.

As an example, it’s possible to reparameterize a learner with a neural network. Some of the benefits of this approach are that it’s quite general and also very expressive. A recurrent neural network can approximate any function under a kind of fairly mild assumption.

The downside to this approach is that it’s a complex model that has a task to learn complex things from data. It has to actually be able to process this data and translate that data into information about the task that informs its actions in the future.

Because it has to learn this complex task completely from scratch, it can sometimes be difficult to train these models. Sometimes they also require an impractical amount of data to optimize effectively.

optimization-based methods

The starting point for optimization-based meta-learning is this idea of fine-tuning. The way that fine-tuning works is that you have a set of pre-trained parameters data, and you run gradient descent on those pre-trained parameters using training data for a new task. For example, you might pre-train some neural network parameters on a data dataset like ImageNet, and then fine-tune parameters on a more modestly-sized dataset for some new tasks that you care about to get some parameters file that works well for that new task.

We’d like to be able to extend this to the few-shot learning setting. Now, unfortunately, if you pre-trained on something like ImageNet and then fine-tuned on only a few examples, it likely won’t work very well because the model may overfit to the small number of samples. Instead, the model should explicitly optimize for a set of pre-trade parameters, such that fine-tuning works well.

Let’s assume we have a set of pre-trained parameter data. You can run one or a few steps of gradient descent from those parameters on your new task and then optimize for the performance of those fine-tuned parameters on held-out data. You can do this for not just one task, but for all the tasks in your meta training set so that you could hopefully transfer to a new task.

Essentially, when using grading descent on this meta-learning process, the model would specifically optimize for the initial pre-trained parameters. This idea is key to learning the initial set of parameters or the initial features that transfer effectively with a small number of gradient steps and a small amount of data.

Visually, what does this look like?

You can view the meta-learning process as a thin black line.

If you’re at a point in the meta-training process and you take a gradient step away from test three, you would be quite far from the optimum for test three and the same would apply to all other tasks.

However, at the end of the meta-training process, if you take a gradient step in the direction of test three, you are close to the optimum. In the same way, it would be similar if you take a gradient test for test one or for test two.

In summary, you’re trying to do a test to set in the parameter space so you can quickly fine-tune to the optimal parameter vector for different tests.

This algorithm is called model agnostic model learning and it’s an example of one optimization-based meta-learning algorithm.

There are other ways that you can twist this such that you can learn the learning rate here, for example, or you could use other inner optimizations, not just gradient descent. In the context of reinforcement learning, what this might look like is trying to learn a set of initial features or representation space under which reinforcement learning is very quick.

Is there a way we can learn a representation under which RL is fast?

For example, you may want a quadrupedal ant agent to run in different directions. So, you meta-train it to be able to run in those different directions. At the end of the meta-training procedure, but before you take a gradient step, you get a policy that would act like this.

The ant is essentially like running in place — it’s ready to run in any direction.

If you then take a gradient step, one gradient step with respect to the task of running backward — you get a policy that looks like this, that runs backward. If you take one gradient step with respect to the task of running forward, you get a policy that looks like the one on the right.

Hopefully, that gives better intuition into why and how exactly this model-based approach works.

We can combine this approach with a model-based reinforcement learning setting, where we want to be able to adapt to different dynamics in the environment. In particular, we want to be able to adapt completely online.

We can store some sort of recent history and then adapt to those last time steps using just one step of gradient descent on a learning dynamics model.

Then, we’ll get our updated model parameters and run planning or model predictive control, MPC — basically planning at each time set using this model when we need to take it back to the new environment.

If you start from scratch and adopt your model with one gradient step, you won’t do very well. What we do is meta-train this agent to be able to adapt to different dynamics using the MAML or the Model-Agnostic Meta-Learning algorithm. We train it such that it can adapt with just a single gradient step.

An example of dynamics variation might be if you cripple the leg of an agent or you, as an agent, has to run across some platforms. If you run standard model-based RL in these environments, without any adaptation, you get behaviour that doesn’t look optimal. If we combine this suboptimal behaviour with a meta-learning approach and train it to quickly adapt online to dynamic changes, the agent can figure out how to evolve quickly to many types of changes like varying buoyancy, or variances within the environment.

This model-based meta-RL approach is quite efficient in comparison to other models designed for the same purpose which means it’s actually very practical to run on a real robot.


There are two different kinds of meta-reinforcement learning methods. One is these black box methods where the neural network is implementing this learn-to-learn procedure. Second is these optimization-based meta-learning methods where there’s an embedding of gradient descent into the model’s learning procedure.

They’re very similar approaches but the key difference with the optimization-based meta learner is that inside this function f, there is a gradient operator that is updating your parameters as a function of the training data set, rather than just ingesting that training data into a neural network.

Some of the benefits of the black box approach is that it’s quite general and expressive. There’s also a variety of design choices and architecture, but it can be challenging to train, whereas an optimization-based method, it’s typically significantly easier to train.

It’s also model-agnostic. If you have architecture already, it’s fairly easy to change the existing architecture. The downside is that sometimes with reinforcement learning methods, the gradient that you get from a reinforcement learning method, like a policy gradient or gradient is often not very informative about the data. Sometimes, it can actually be challenging to combine this approach with an optimization-based approach with policy gradients (especially if the gradients are noisy or have high variance)

open challenges

The field of Meta-Learning and specifically Meta RL present many unique challenges that limit our understanding of the field and progress with the implementation of Meta-based algorithms.

adapting to entirely new tasks

We can obviously adapt relatively easily to different dynamics models, like different terrains, for example, or adapt to running in different directions or different mazes. In many situations, we may want to adopt something that’s more distinctly different than the meta-training set.

We should follow this general principle that applies in the field of machine learning more broadly that the task distribution for meta-training and meta-testing is the same. Both sets of data need a broad distribution of tasks for meta training. If a model is expected to adapt to something new, the task distribution needs to be quite broad.

Where might we actually get this broad distribution of tasks?

We could look to existing RL benchmarks, but things like OpenAI Gym or the Atari Benchmark aren’t exactly suitable for these types of problems because the tasks are different from one another, or there just aren’t enough tasks to support generalization to new tasks.

With this in mind, many benchmarks have been developed for the study of meta-reinforcement learning like the Meta-World benchmark and Alchemy.

Meta-World has 50 tasks from very distinct manipulation settings where the goal is to manipulate an object in a new way which makes it effective for studying meta-learning algorithms.

The benchmark is interesting because researchers don’t need to worry as much about RL failing on these tasks — they can instead just focus on the transfer of learning between different tasks. If you actually apply meta-learning algorithms, and existing meta-learning algorithms to this benchmark, they don’t do very well at all as it’s challenging to scale approaches to this so far.

Alchemy is another popular benchmark for meta-learning. It’s a 3D video game played in a series of trials that fit together into episodes.

Alchemy: Structured Task Distribution for Reinforcement Learning

Researchers evaluated the Alchemy environment on two powerful deep RL agents (IMPALA and V-MPO). Although these agents are quite good in single-task RL environments, they performed poorly even after extensive training.

sourcing: where do the tasks come from?

There needs to be a way to somehow define these tasks for meta-learning. We meta-learn the algorithms and assume the tasks are given but this also means that a human needs to be able to provide them to the agent. Manually defined distributions of tasks and corresponding rewards need to be engineered.

Ideally, we would want to have mechanisms for the algorithm itself to come up with tasks for it to solve.

The key takeaway from this is we need to start moving away from reinforcement learning approaches from scratch. Meta RL provides an effective approach to doing this.

There are three components to Meta-RL:

  • A Model with Memory- maintains a hidden state + can acquire and memorize the knowledge about the current task by updating the hidden state
  • Meta-learning Algorithm- decides how we can update the model weights to optimize for the purpose of solving an unseen task fast at test time.
  • A Distribution of MDPs (Markov Decision Processes) — through exposure agent has to learn how to adapt to different MDPs.

on the future of meta-rl

The broader implications of innovations in Meta-RL are exciting because it helps us make improvements and better understand meta-learning as it applies to general machine learning. The general trend in research has been to avoid fine-tuning models and instead use a meta-learning algorithm tasked with finding the best architecture and hyperparameters.

Even with meta-RL where we may see different models being tested to generate the most optimal results, we may begin to see research in the space be applied to quickly changing environments.

For example, one of the biggest applications for reinforcement learning remains in drug discovery and molecule simulations. What is interesting about these environments that we can simulate in RL models is how different human biology is. Even beginning to understand the complexity of the environment is a challenge. Being able to fine-tune parameters and tweak the environment presents huge opportunities to cut down on training time and compute power as well as create more accurate predictions and simulations.

On the more general front, looking towards a better understanding of human learning, we may begin to see and even monitor the exact points that a learner improves through experiences which can help us build better models and solve harder problems.

“We should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.” — Richard Sutton, The Bitter Lesson (March 13, 2019)

Feel free to connect with me 🔗
Linkedin | Twitter | Website | Substack

👇 more resources
Oriol Vinyals: Perspective and Frontiers of Meta-Learning



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store