## Diving Into MAML, Fast Adaptation of Deep Neural Networks | Towards AI

# How to Train MAML(Model-Agnostic Meta-Learning)

## An elaborate explanation for MAML and more

# Introduction

**M**odel-**A**gnostic **M**eta-**L**earning(MAML) has been growing more and more popular in the field of meta-learning since it’s first introduced by Finn et al. in 2017. It is a simple, general, and effective optimization algorithm that does not place any constraints on the model architecture, nor loss functions. As a result, it can be combined with arbitrary networks and different types of loss functions, which makes it applicable to a variety of different learning processes.

This article consists of two parts: we first explain MAML, presenting a detailed discussion and visualizing the learning process; Then we describe some of the potential problems of the original MAML and address them following the work of Antoniou et al.[2].

# MAML

The idea behind MAML is simple: it optimizes for a set of parameters such that when a gradient step is taken with respect to particular task *i*, the parameters *θᵢ* are close to the optimal parameters for task *i*. Therefore, the objective of this approach is to learn an internal feature that is broadly applicable to all tasks in a task distribution *p(T)*, rather than a single task. This is achieved by minimizing the total loss across tasks sampled from the task distribution *p(T)*

Note that we do not actually define an additional set of variables *θᵢ’* here. *θᵢ’ *is just computed by taking one(or several) gradient step(s) from *θ* w.r.t. task *i *— this step is generally called inner loop learning, in contrast to outer loop learning in which we optimize Eq.(1). For better understanding, if we take the inner loop learning as fine-tuning *θ* with respect to task *i*, then Eq.(1) equally says that we optimize an objective in the expectation that the model does well on each task after respective fine-tuning.

Another thing worth attention is that when optimizing Eq.(1), we will eventually end up computing Hessian-vector products, which is costly even if one can get by with the conjugate gradient method. Finn et al. have conducted some experiments using a first-order approximation of MAML on supervised learning problems, where these second derivatives are omitted(which could be achieved programmatically by stopping computing the gradient of *∇_θ(L_Tᵢ(f(θ))*). Note that the resulting method still computes the meta-gradient at the post-update parameter values *θᵢ’*, which provides for effective meta-learning. Experiments demonstrated that the performance of this method is nearly the same as that obtained with full second derivatives, suggesting that most of the improvement in MAML comes from the gradients of the objective at the post-update parameter values, rather than the second-order updates from differentiating through the gradient update.

## MAML Visualization

Note that if the inner loop learning is repeated *N* times, MAML only uses the final weights for outer loop learning. As we will see later, this could be troublesome, causing unstable learning when *N* is large.

## Algorithm

It now should be straightforward to see the algorithm

# MAML++

In this section, we will focus on several issues of the original MAML and present the corresponding potential solutions and the final algorithm MAML++. All these contributions are originally proposed by Antoniou et al.[2] in ICLR 2019.

## Training Instability

MAML training could be unstable depending on the neural network architecture and overall hyperparameter setup. For example, Antoniou et al. found that simply replacing max-pooling layers with strided convolutional layers rendered it unstable as Figure 1 suggests. They conjectured the instability was caused by gradient degradation(either gradient explosion or vanishing gradient), which in turn caused by a deep network. To see this, we take a look back at the visualization of MAML. Assuming the network is a standard 4-layer convolutional network followed by a single linear layer if we repeat the inner loop learning *N* times, then the inference graph is comprised of *5N* layers in total, without any skip-connections. Since the original MAML only uses the final weights for the outer loop learning, backpropagation has to pass through all layers, which makes sense of the gradient degradation.

**Solution: Multi-Step Loss Optimization (MSL)**

We could adopt a similar idea from GoogLeNet to ease the gradient degradation problem by computing the outer loss after every inner step. Specifically, we have the outer loop update

where *β* is a learning rate, L_Tᵢ(f(θ_j^i)) denotes the outer loss of task *i* when using the base-network weights after *j*-inner-step update and *wⱼ* denotes the importance weight of the outer loss at step *j*. We also visualize this process for better comparison

In practice, we initialize all losses with equal contributions towards the loss, but as iterations increase, we decrease the contributions from earlier steps and slowly increase the contribution of later steps. This is done to ensure that as training progresses the final step loss receives more attention from the optimizer thus ensuring it reaches the lowest possible loss. If the annealing is not used, the final loss might be higher than with the original formulation.

## Second-Order Derivative Cost

MAML reduces the second-order derivative cost by completely ignoring it. This could impair the final generalize performance in some cases.

**Solution: Derivative-Order Annealing (DA)**

Antoniou et al.[2] propose to use the first-order gradient for the first 50 epochs of the training phase and then switch to second-order gradients for the remainder of the training phase. An interesting observation is that this derivative-order annealing showed no incidents of exploding or diminishing gradients, contrary to second-order only MAML which were more unstable. Using ﬁrst-order before starting to use second-order derivatives can be used as a strong pretraining method that learns parameters less likely to produce gradient explosion/diminishment issues.

## Absence of Batch Normalization Statistic Accumulation

MAML does not use running statistics in batch normalization. Instead, the statistics of the current batch is used. This results in batch normalization being less effective since the parameters learned have to accommodate for a variety of different means and standard deviations from different tasks.

## Solution: Per-Step Batch Normalization Running Statistics, Per-Step Batch Normalization Weights and Biases (BNRS + BNWB)

A naive implementation of batch normalization in the context of MAML would accumulate running batch statistics across all update steps of the inner-loop learning. Unfortunately, this would cause optimization issues and potentially slow down or altogether halt optimization. The problem stems from a wrongly placed assumption: when we maintained the running statistics shared across all inner loop updates of the network, we assumed the initial model and all its updated iterations had similar feature distributions. Obviously, this assumption is far from correct. A better alternative is to store per-step running statistics and learn per-step batch normalization parameters for each of the inner-loop iterations.

## Shared Inner Loop Learning Rate

One issue that affects the generalization and convergence speed is the issue of using a shared learning rate for all parameters and all update-steps. Having a fixed learning rate requires doing multiple hyperparameter searches to find the correct learning rate for a specific dataset, which can be computationally costly, depending on how search is done. Moreover, while gradient is an effective direction for data fitting, a fixed learning rate may easily lead to overfitting under the few-shot regime.

**Solution: Learning Per-Layer Per-Step Learning Rates (LSLR)**

To avoid potential overfitting, one approach is to determine all learning factors in a way that maximizes generalization power rather than data fitting. Li et al.[3] propose to learn a learning rate for each parameter in the base-network. The inner loop update now becomes

Where *α* is a vector of learnable parameters with the same size as L_Tᵢ(f(θ)), and ∘ denotes element-wise product. The resulting method, namely Meta-SGD, has been demonstrated to achieve a better generalization performance than MAML, with the cost of increasing learning parameters and computational overhead. Note that we do not put the constraint of positivity on the learning rate *α*. Therefore, we should not expect the inner-update direction follows the gradient direction.

Considering the induced cost of Meta-SGD, Antoniou, et al.[2] propose learning a learning rate for each layer in the network as well as learning different learning rates for each adaptation of the base-network as it takes steps. For example, assuming the base-network has *L* layers and the inner loop learning consists of *N* steps updates, we now introduce *LN* additional learnable parameters for the inner loop learning rate.

## Fixed Outer Loop Learning Rate

MAML uses Adam with a fixed learning rate to optimize the meta-objective. It has been shown in some literature that an annealing learning rate is crucial for generalization performance. Furthermore, having a fixed learning rate might mean that one has to spend more time tuning the learning rate.

**Solution: Cosine Annealing of Meta-Optimizer Learning Rate(CA)**

Antoniou et al.[2] propose applying the cosine annealing scheduling(Loshchilov & Hutter[4]) on the meta-optimizer. The cosine annealing scheduling is defined as

where *β_{min}* denotes the minimum learning rate*, β_{max} *denotes the initial learning rate, *T* is the number of current iterations*, T_{max}* is the maximum number of iterations. When *T=0*, learning rate *β=β_{max}*. Once *T=T_{max}*, *β=β_{min}*. In practice, we might want to bound *T* to be *T_{max}* to avoid restart.

# Experimental Results of MAML++

Finally, we present some Experimental results for completeness.

We first show individual improvement on the MAML in the 20-way Omniglot tasks.

We can see that per-step batch normalization and running statistics(BNWB+BNRS) and learning per-layer per-step learning rates(LSLR) gain the most fruit.

We also show MAML on Mini-Imagenet tasks

We can see that as the number of inner steps increases, performance improves a bit. Noticeably, even 1-step MAML++ outperforms the original 5-step MAML.

As shown in Figure 1(at the beginning of our discussion on MAML++), MAML++ also converges much faster to its best generalization performance when compared to MAML.˙

# References

- Chelsea Finn, Pieter Abbeel, Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
- Antreas Antoniou, Harrison Edwards, and Amos Storkey. How To Train Your MAML
- Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning
- Ilya Loshchilov & Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts