Getting Familiar with Bootstrapped Meta Learning — An Overview

Đồng Nguyễn Minh ANH
13 min readFeb 10, 2023

Anh’s Take on Research Summaries Ep.1 : Boostrapped Meta-Learning (2022) — Sebastian Flennerhag et al.

As of February, one of my goals was to summarize and provide my insights on one research paper every week. The first paper is about an algorithm that enables the meta-learner to self-teach, effectively overcoming the challenge of meta-optimization. By utilizing gradients for meta-learning, the algorithms guarantees improved performance. Additionally, the paper also explores the potentials of bootstrapping in meta-learning.

Before we get into summarizing, there are a few terminologies that would make the article quite long (and boring to read), so I made a document with those terminologies that are bolded (not bolded full sentences).

INTRODUCTION

Acquiring the skill of learning is a common strength among us, as we use our past experiences to enhance the process of learning new tasks. On the other hand, endowing AI systems with the same capability is challenging since the machine learning algorithms needs to adapt to updated rules, which are usually customized for each task through manual adjustments.

The field of meta-learning looks at, “how to enable machine learners to learn how to learn”. This is a crucial aspect of advancing AI efficiency.

One approach involves the machine learner understanding an update rule by applying it to past steps and evaluating its performance. Realizing the full potential of meta-learning requires deciphering at both the problem of meta-optimization and limited meta objectives.

To address these, researchers from DeepMind have proposed an algorithm that empowers meta-learners to self-teach.

LET’S LOOK AT AN ANALOGY

Again, a meta-learning is being trained to improve its learning performance over time by using the experience from past tasks.

The goal of this to reduce the amount of labeled data required to learn new tasks, and to improve the speed and effectiveness of learning.

This is achieved by allowing the model to leverage its prior knowledge and experience to better understand the structure of new tasks, and to identify the most effective learning strategies to use.

Ok, so imagine the meta-learner as a student named Anna.

Anna, as a student (a.k.a the meta-learner) is not only learning the material, but also learning how to learn the material more effectively. Each time Anna complete a new assignment, she reflect on her learning process and make adjustments to improve their overall understanding. Over time, she becomes better and faster at learning, not just in one subject, but across a variety of subjects.

Similarly, a meta-learner is not only learning to perform a specific task, but is also learning how to learn tasks more effectively, so that it can quickly adapt to new tasks with limited data.

To learn more, read the full paper here.

These people listed above are authors of the research paper, and are researchers from DeepMind.

In order for a meta-learner to determine its update rules, it must first evaluate these rules. This evaluation process involves applying the rules before evaluation, which can result in excessive computational expenses.

Past research has relied on the assumption that optimizing performance after a certain number (K) of update rule applications will result in improved performance for the remainder of the learning process.

However, if this assumption proves to be incorrect, meta-learners may suffer from a limited perspective bias. Additionally, optimizing performance after K updates may overlook the learning process itself.

As pointed out in this paper: “[when a machine trying to implement the above process], it must be able to learn its update rule. Meta-learning is one approach that learns (parts of) an update rule by applying it for some number of steps and then evaluating the resulting performance.

When learning an update rule, it must be applied, doing creates high computational costs. This is usually combated by using K uses of the update rule and then optimizing, and does not affect the performance for the remaining life of the learner.

The process of meta-optimization creates 2 key challenges:

  1. Curvature, where the meta-objective is limited to the same type of geometry as the learner.
  2. Myopia, where the meta-objective only evaluates performance within a limited horizon, ignoring future learning dynamics.

The proposed algorithm developed by the research team from DeepMind addresses these challenges with two key features.

→ To prevent myopia, the algorithm uses bootstrapping methods to integrate information about future learning dynamics into the objective.

→ To control curvature, the meta-objective is made to minimize the distance to the bootstrapped target.

The algorithm operates in two steps: it bootstraps a target from the learner’s updated parameters and projects the updated parameters and target onto a matching space (such as Euclidean space or KL divergence).

The meta-learner’s objective is to minimize the distance to this bootstrapped target.

The algorithm uses a novel Bootstrapped Meta-Gradient (BMG) to optimize the meta-objective, incorporating information about future learning dynamics without increasing the number of steps required for back-propagation. The BMG accelerates the optimization process and, as demonstrated in the paper, leads to improved performance.

The paper also highlights a, “new state-of-the art for model-free agents on the Atari ALE benchmark and demonstrate that it yields both performance and efficiency gains in multi-task meta-learning”, as well as how bootstrapping opens up new avenues and, “can meta-learn efficient exploration in an epsilon-greedy Q-learning agent, without backpropagating through the update rule.”

To summarize BMG (bootstraped meta-gradients):

The article talks about a meta-learning algorithm that works by bootstrapping a target from a meta-learner and then, optimizing it by minimizing the distance to that target. This algorithm uses gradients to establish conditions that aims to enhance performance and shows that the metric can control meta-optimization.

BOOTSTRAP META-LEARNING

Tying back with the definition of curvature and myopia that we have explored before. A target is generated by continuing to update the learner’s parameters, for some number of steps.

The target bootstrap is a key feature that has been introduced to improve the performance of the system. The guarantee of this improvement is supported by the equations 1 and 2 in Section 3 of the paper, which provide a mathematical explanation of the process.

The target bootstrap extends the learning by repeating the process a specified number of times (L-1 times in the case of this paper).

The target is a fixed quantity and there is no back-propagation through it, which helps to reduce the computational cost of the process.

The meta-learner’s task is to match this target, and it is up to the meta-learner to do so. The target bootstrap offers a way to extend the learning without increasing computational costs by too much. The mathematical details of the process can be found in Section 4 of the paper.

MINIMIZING DISTANCE/DIVERGENCE TO TARGET IN MATCHING SPACE

This is what helps control the landscape for meta-learning. The learner’s updated parameters and the target are projected into a matching space, and the meta-learner will be optimized within this space.

A matching function 𝜇 is introduced, measuring the similarity or lack there of, between the meta-learner’s output and the target. The paper mentions and makes use of KL-Divergence as an option for this measure.

The bootstrapped meta gradient update is then as follows (Equation 2). This formula is used to optimize the meta-parameters of a learning algorithm so that it can better learn the target using the base learning algorithm.

The formula above ^ updates the meta-parameters (w) by computing the gradient (∇w) of the matching function (𝜇) with respect to the second slot, and then adjusting the meta-parameters (w) by a factor of the chosen constant (β). The updated meta-parameters (w˜) are used to produce the target (x˜) by applying the learner (𝜑) to the inputs (x^(k)(w)) a certain # of times (K).

This formula is part of meta-learning. In this context, the meta-parameters (w) represent the parameters of a learning algorithm that is used to update the parameters of another, base learning algorithm (𝜑).

The purpose is to adjust the meta-parameters (w) so that the base learning algorithm (𝜑) produces better results on the target (x˜). The gradient of the matching function (∇w) provides details on how the meta-parameters (w) should be adjusted to upgrade the performance of the base learning algorithm (𝜑). The constant (β) determines the step size of the update and controls how much the meta-parameters (w) are changed at each iteration.

Bootstrapping mechanism allows the algorithm to extend the effective meta-learning horizon without the need for back-propagation. It does this on the Atari ALE benchmark, resulting in an improved performance and efficiency in multi-task meta-learning. Additionally, such mechanism opens up new possibilities, like meta-learning efficient exploration in an ε-greedy Q-learning agent, without the need for back-propagation through the update rule.

To understand the boostrapping mechanism better, think of a carpenter, or a builder who is measuring a tool to make an object.

To better understand this, think of a carpenter who uses a measuring tool (meta-learner) to build the target (the goal), and then uses a tool to optimize the measuring tool (meta-optimization) to ensure that it’s as accurate as possible.

The bootstrapping mechanism allows the carpenter to extend their reach, making it simpler to build bigger and better structures without having to recalibrate their measuring tool.

The goal of the algorithm is to overcome two bottlenecks in meta-optimization: curvature and myopia.

→ The algorithm achieves this by bootstrapping and formulating the meta-objective in terms of minimizing distance to the bootstrapped target, thereby controlling the meta-loss landscape.

It also optimizes the meta-learner by minimizing the distance between the learner’s new parameters and the bootstrapped target. The paper explains more in-depth on how the algorithm can control curvature by projecting the parameters and the target onto a matching space.

To conclude, A meta-learner learns about an update rule by generating future targets from it. The approach is related to methods in multi-task meta-learning, as well as the concept of self-referential meta-learning. Target matching under KL divergences is a form of distillation and the idea of BMG (the approach in the article) is formed by trust-region methods that introduce a distance function to regulate gradient updates.

Bootstraped meta learning has a goal it wants to reach and it tries to get closer to that goal by changing some settings. The computer has a plan on how to change itself, but sometimes it might not work as well as it hoped.

So, it uses something called a “Target Bootstrap” to help it know if it is going in the right direction. It also uses a “matching function” to see how close it is to its goal. These things help the computer make better decisions on how to change itself so it can get closer to its goal.

PERFORMANCE GUARANTEES

The analysis focuses on two algorithms, Meta-Gradients (MG) and Bootstrapped Meta-Gradients (BMG), and how they perform when there is no noise.

The MG algorithm can work well if the problem is simple, but if the problem is more complex, the performance may suffer.

On the other hand, the BMG algorithm has a better chance of success as it ensures that the learning direction is always moving in the direction of the steepest improvement and provides a stronger learning signal.

The results show that the BMG algorithm performs better than the MG algorithm and provides a larger improvement.

Thus, BMG > MG.

REINFORCEMENT LEARNING

In RL, an agent learns to make decisions by taking actions in an environment to maximize a reward. It’s like teaching a child to play a game by giving them rewards for making good moves and penalizing them for making bad moves. Over time, the child (or the agent in the case of reinforcement learning) will learn to make the best decisions to maximize their rewards. It’s also applied in psychology in conditioning!

Reinforcement learning is a way for a computer to learn how to solve problems by trying things and receiving rewards or punishments. It uses a system called Markov Decision Process to keep track of different choices (called states), what actions can be taken in each state, and what will happen as a result (called rewards and transitions). The goal of RL is to find the best series of actions to take in each state to get the most reward.

To improve the policy, a new one is constructed that has a higher expected reward than the current policy. Most meta-RL algorithms use actor-critic approaches. To help it learn, it uses two parts, an “actor” and a “critic”. The actor is like a decision-maker, it decides what to do in a certain situation. The critic is like a score-keeper, it tells the actor how well it’s doing.

The computer tries to find the best way to solve the problem by adjusting the actor and critic based on their mistakes. This is called “optimizing” the policy.

The goal is to get the best score, which is a combination of how well the actor is doing, how much it’s exploring new options, and how well the critic is estimating the scores.

The team conducted experiments to compare the performance of BMG with standard meta-gradients. These experiments were performed using a typical reinforcement learning task modeled as a Markov Decision Process, where the goal was to learn a policy that maximizes the value given an expectation.

The computer tries to find the best way to solve the problem by adjusting the actor and critic based on their mistakes. This is called “optimizing” the policy. The goal is to get the best score, which is a combination of how well the actor is doing, how much it’s exploring new options, and how well the critic is estimating the scores.The team conducted experiments to compare the performance of BMG with standard meta-gradients. These experiments were performed using a typical reinforcement learning task modeled as a Markov Decision Process, where the goal was to learn a policy that maximizes the value given an expectation.

Figure 2 in Section 5.1 compares the performance of an actor-critic agent, in a changing virtual environment. The results are based on 50 tests. The right part of the figure shows how the agent learns to balance exploration and exploitation over 4 task cycles after 6 million steps in the environment.

This is a study on meta-learning in a grid-world environment where 2 items with different rewards are collected and re-spawned randomly.

The study compares the performance of a memory-less actor-critic agent with fixed entropy-rate weight against agents that meta-learn the weight online.

The main focus is on the effect of bootstrapping, where the computational complexity of the method is constant.

The results show that both meta-learned methods (MG and BMG) outperform the baseline, but BMG has greater adaptive capacity and can meta-learn without back-propagating through the update rule. It also takes into account of a new form of meta-learning by learning ε-greedy exploration in a Q(λ)-agent, and BMG outperforms the best fixed ε found by hyper-parameter tuning. BMG responds positively to longer meta-learning horizons.

Figure 3: BMG ε-greedy exploration under a Q(λ)-agent.
Figure 4 compares the performance of 2AI methods, BMG and STACX*, in a virtual gaming environment called the Atari ALE. The comparison is based on how well the it performs compared to human scores in 57 different games. The left part of the figure shows the difference in scores between the 2 methods, and the right part shows the average scores of the 2 methods over time compared to previously published results, with shaded areas showing how much the scores vary across 3 separate tests.

ATARI

The paper compares the effectiveness of two reinforcement learning algorithms, BMG and STACX, on the Atari Arcade Learning Environment (ALE), which is a platform for evaluating and comparing AI algorithms in playing Atari games. The goal is to evaluate the performance of BMG in comparison to another meta-RL agent, Self-Tuning Actor-Critic (STACX).

The comparison between BMG and STACX is done by evaluating their median Human Normalized Score (HNS). The HNS is a metric that measures how well an AI agent performs in comparison to a human player. It is calculated by dividing the AI agent’s score by the score of a professional human player.

The median HNS is used because it provides a better evaluation of the agent’s performance, as it is less sensitive to outliers.

The results show that BMG outperforms STACX by obtaining a median HNS of ~500% compared to ~350% for STACX. BMG’s performance is further improved by using both policy matching and value matching, obtaining a median HNS of ~520%.

The results also show that extending the meta-learning horizon to L = 4 results in an even higher score of ~610%. These results were obtained without hyperparameter tuning for BMG.

Figure shows how BMG is better than another program called STACX. They tested BMG on a game called Ms Pacman and looked at different ways to make it better.

The researchers discovered that BMG's success was attributed to three factors:

45% was from adjusting the learning speed

25% from focusing on learning the right information

30% from ensuring that the algorithm considered future outcomes.

Further experimentation with BMG showed that the best results were obtained by combining policy and value matching. The researchers also found that when BMG was made to consider future outcomes more, it performed faster than STACX.

MULTI-TASK FEW-SHOT LEARNING

Multi-task meta-learning is when a computer tries to learn different things at the same time.

The goal is to see how well the computer can do many tasks with only a few examples for each task. The computer is given a set of rules to follow, and tries to find the best way to follow those rules for each task, and to learn the best way to do a task by adjusting the rules. This is based on how well it does on each task. The goal is to see if the computer can improve and get better at doing different tasks with only a few examples each time.

The paper presents the results of a meta-learning algorithm on a MiniImagenet dataset. The performance of the algorithm is evaluated in terms of 5-way-5-shot classification accuracy and is reported in three different ways: as a function of the number of meta-training batches, wall-clock time, and the best reported performance for each value of K (a hyperparameter). Error bars are provided to show the standard deviation across 3 runs of the experiment.

BMG and MG are ways to help computers learn new things. They used something called MiniImagenet to test how well they work. They looked at how well they work with different amounts of training, and how long it takes to get good results.

They also tried different settings to see what works best. In the end, they found that BMG > MG. It works better with less training and faster. And it gets better results in the end.

CONCLUSION

The paper is about a new way to do meta-learning, which is when a machine learns how to learn. Instead of using the usual way of figuring out how to learn, this new way matches what the machine wants to learn with what it should learn.

This new way is better because it makes the machine learn faster and better, and can even help the machine learn how to explore things on its own. The researchers tested this new way on some games and found that it works really well, even better than the old way.

The paper talks about ML algorithms. It used a lot of mathematical notation (of which I will be honest, have no clue what the majority of the Greek symbols mean), but the main idea is to describe a new method for updating the parameters of an ML model.

The method is “Bootstrapped Meta-Gradient Descent” and it’s designed to enhance the convergence of the learning process. The authors prove that their method leads to a decrease in the value of the objective function (the function being optimized) for small values of the step size parameters. To simplify, the BMG Descent is a good way to update the model’s parameters and helps the learning process converge faster to the optimal solution.

In the evaluations, BMG demonstrated performance improvements on the Atari ALE benchmark and enhanced model-agnostic meta-learning (MAML) in the few-shot setting, indicating the study’s potential to open up new possibilities for efficient meta-learning exploration.

For more details, please refer back to the research paper.

--

--