Paper Explained — MAML Is a Noisy Contrastive Learner in Classification

Chia-Hsiang Kao (高家祥)
12 min readMar 22, 2022

--

We prove that model-agnostic meta-learning (MAML) is a noisy contrastive learning algorithm and propose a zeroing trick to mitigate the noise.

Paper: arxiv.org — [2106.15367] MAML is a Noisy Contrastive Learner in Classification

Github: GitHub — IandRover/MAML_noisy_contrasive_learner

Since Finn Chelsea first proposed MAML (Model-agnostic meta-learning) in 2017, we have witnessed a large body of variants in a wide spectrum of tasks (e.g., regression, classification, reinforcement learning, …) and scenarios (e.g., adversarial learning, self-supervised learning, …). However, despite its immense popularity, I found that little work has been devoted to explaining the validity and power of MAML.

It is well-known that MAML enables models to learn general-purpose representations. But why? And how? One related work, “Recasting Gradient-Based Meta-Learning as Hierarchical Bayes,” reformulated MAML as a Hierarchical Bayesian model and offered a high-level generalization of MAML. Unfortunately, for me, the explanation does not provide an intuitive and straightforward interpretation of the inner working mechanism of MAML.

Particularly, I wonder:

  1. Why is MAML effective in learning general-purpose representations?
  2. What is the role of support and query data in MAML?
  3. What is the role of the inner loop and outer loop in MAML?

Thankfully, after a hard investigation, we find that, given a mild assumption, MAML is a noisy version of the supervised contrastive learning algorithm. Our work is later accepted by ICLR 2022 as a poster.

Our contribution to the community is three-hold:

  1. We contribute the success of MAML in representation learning to its implicit contrastiveness inherent in the bi-level optimization scheme.
  2. We explain why second-order MAML is more powerful than first-order MAML from a contrastive learning perspective.
  3. We propose the “zeroing trick” that mitigates the noises originating from the vanilla MAML and perform extensive experiments on its effectiveness.
The reviewer requested us to add “in classification” in the rebuttal. We agreed.

A Motivating Example

Here, I present a motivating example. I presume the readers are familiar with the few-shot learning classification setting and the bi-level algorithmic procedures of MAML.

In this motivating example, we use a model composed of an encoder (ϕ) and a linear classifier (parameterized by w) and consider the following conditions:

  1. MAML with one inner-loop iteration.
  2. Using MSE as the inner-loop and outer-loop loss.
  3. Setting the weights of the linear classifier to be zero (i.e., w=0) at the start of an outer-loop iteration (i.e., at the beginning of the first iteration of an inner loop).
  4. Five support data {s1, s2, …, s5} that belong to five different classes. In other words, si belongs to the ith class.
  5. One query data {q1} belonging to the third class.
  6. Setting the inner-loop learning rate η to 1.

In the gif below, we illustrate what happens in one outer-loop iteration.

A step-by-step illustration showing the SCL objective underlying MAML. Assume the parameters of the linear classifier w0 to be zero, we find that, during the inner loop, the ith column of w0 is added with the support features of the ith class. In other words, the support features are memorized by the linear layer during the inner loop. In the outer loop, the output of a query sample is the inner product of ϕ(q1) and w1 , which is essentially the inner product of the query features and all the support features. The outer loop loss aims to minimize the MSE between the inner products and the one-hot label. Thus, MAML displays the characteristic of supervised contrastiveness. Besides, the support data acts as positive samples when the labels of support and query data match, and vice versa.

Below, we write down the exact procedures.

  1. In the inner-loop iteration, the outputs of the model are all zeros.
  2. In the inner-loop iteration, the weights of the linear classifier are updated by adding the ith class data embedding (with a multiplier η) to the ith column of the linear classifier due to gradient descent. We name this phenomenon “support data memorization”.
  3. In the outer-loop iteration, after forwarding q1, we discover that the output is actually the inner products of embeddings between q1 and {s1, s2, …, s5}. To be specific, as η =1, the logit at the first channel is the inner product of ϕ(q1) and ϕ(s1).
  4. In the outer-loop iteration, at the first, second, fourth, and fifth channels, the outer-loop loss dictates that the logits be zero. In other words, for i = 1, 2, 4, 5, (ϕ(q1) ‧ ϕ(si) )² = 0. However, at the third channel, the MSE object has the logit to be one, i.e., (ϕ(q1) ‧ ϕ(s3) — 1 )² = 0

Amazingly, we have the following observations.

First, during the inner loop, the linear classifier memorizes the support embeddings. To be more specific, each column of the weights of the linear classifier is updated by adding the embedding of the corresponding support data.

  • Note that, if we consider a 5-way 2-shot scenario here: {s1, s2, …, s10}, where {s1,s2} belongs to the first class, {s3, s4} belongs to the second class, … etc. Then, the column of the linear classifier is added with the summation (or average) of the embedding of the corresponding support data. Yes, this is the centroid.
  • Also, note that the memorization phenomenon is ubiquitous as long as gradient descent is adopted. But the form of memorization can differ depending on the objective function one uses.

Second, during the outer-loop iteration, the output (of the model that takes q1 as input) is the inner product between the support embeddings and the query embedding.

  • Now, we see how the support data and query data interact.
  • Please note that the inner product here should not only be understood as a multiplicative interaction but also a measurement of similarity. Moreover, one can readily replace the inner product operation with other measurements of similarity or metrics, such as cosine similarity of negative Euclidean distance.

Third, the outer loop loss actually performs supervised contrastive learning.

  • Generally speaking, the loss in the supervised contrastive learning comprised two parts: a positive sample loss and a negative sample loss. In the positive part, the loss aims to increase the similarity between (the embeddings of) two samples from the same class. In contrast, in the negative part, the loss seeks to increase the similarity between (the embeddings of) two samples from different classes.
  • In our case, the positive sample of q1 is s3, and the negative counterpart are s1, s2, s4, and s5.
  • Obviously, in the motivating example, the overall loss that MAML essentially adopts is a supervised contrastive loss.

Extending to a Real Few-Shot Setting

Illustration of few-shot learning. From Understanding Few-Shot Learning in Computer Vision: What You Need to Know.

Two weeks before the deadline, we discover that simply (1) cleaning the linear classifier and (2) considering the case of one inner-loop iteration make MAML a supervised contrastive learner.

Based on the very preliminary example, we move on and ask what if the classifier is not zeroed in the beginning?

Well, simply looking at the motivating example, when the classifier is not “purified” initially, the measurement of similarity between samples is not accurate. For example, let’s say the weight of the classifier is random at the beginning. Then, the weights of the linear classifier are added with “noise” and “support features” in the inner-loop update, amplifying the interference brought. Consequently, in the outer loop, the output of the model (that takes q1 as input) is interfered with the noise, making the supervised contrastiveness inaccurate. Not to mention the case of sampling multiple tasks during one outer-loop iteration, where the same noise is added to the support data of different classes. This can further interfere with representation learning.

Still, more questions naturally arise from the motivating example. For example,

  1. What about the case of more inner loop updates?
  2. What about using a SoftMax output?

For the above questions, we recommend the reader to the complete analysis in our paper.

EFIL Assumption

In the above motivating example, we can directly write down ϕ(q1) as the embedding of the query data because the encoder ϕ is not updated during the inner loop (because the classifier is zeroed and therefore the backpropagated error for the encoder becomes zero). However, to thoroughly analyze the vanilla MAML, we can not zero the linear classifier and must deal with the fact that the encoder is updated during the inner loop.

Thankfully, the paper published in ICLR 2020, “Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML,” gives us great inspiration. In their work, the authors postulate that, since MAML enables the model to learn a generalizable encoder, the encoder may not need to be changed/updated to finetune to a task during the inner loop. As a result, they proposed the ANIL (Almost No Inner Loop) algorithm, in which the encoder is not updated during the inner loop.

They necessitate the unnecessity of updating the encoder during the inner loop by showing that

  1. Features of all layers except the linear head are highly similar before/after the inner loop. (The encoder here contains a four-convolutional-layered encoder and a linear classifier)
Inner loop updates have little effect on learned representations from early on in learning. From Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML.

2. The training trajectoties and performance of models trained using MAML and ANIL are both similar.

MAML and ANIL learn very similarly. Loss and accuracy curves for MAML and ANIL on MiniImageNet-5way-5shot, illustrating how MAML and ANIL behave similarly through the training process. From Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML.

Besides, we are aware of several concurent work that only updates the parameters of the linear classifier during the inner head, such as R2-D2, MetaOptNet.

Consequently, we decided to formulate the procedure that “the encoder is fixed during the inner loop” as an assumption, based on which we derived our primary theoretical outcomes.

Below, I illustrated the difference between MAML and ANIL in the parameter space. In MAML, the parameters of both encoder and classifier can be updated freely in the inner loop (green curve > the trajectory of parameters during the inner loop) and outer loop (blue arrow > the gradient evaluated on query data during the outer loop). While in ANIL, the encoder parameters are fixed in the inner loop (right figure, the green curve is horizontal, meaning that the parameter of the encoder is not changed during the inner loop).

Complete Analysis

We show how the weight of the linear layer is updated during the inner loop and outer loop under the EFIL assumption. Then, we describe the loss that MAML uses as a noisy supervised contrastive objective function. Moreover, we show that if we zero the linear classifier at the first iteration of the inner loop, MAML is using a supervised contrastive objective function.

Again, we recommend the reader to the complete analysis in our paper.

Why does SOMAML converge faster than FOMAML? A contrastive learning perspective.

Empicially speaking, we know that models trained using second-order MAML (SOMAML) converge faster than those trained using first-order MAML (FOMAML). Again, we ask why? From the equation, the answer seems straightforward — because SOMAML exploits the second-order information, the Hessian matrix, during the outer-loop update. Therefore, the meta-gradient in SOMAML is undoubtedly more accurate than that in FOMAML.

Nevertheless, we would like to offer a novel viewpoint, where SOMAML is actually a more robust supervised contrastive learning algorithm. We adopt the motivating example above to illustrate the difference.

Here, we use φ to denote the parameter of the encoder. As shown below, the main distinction between FOMAML loss and SOMAML loss is an additional gs(‧) function in FOMAML loss. Here gs(‧) means “gradient stopping” function, meaning that the variable is treated as a constant.

Here, we derive the exact loss and gradient that first-order MAML and second-order MAML adopts.

The distinction tells us that

  1. The goal of FOMAML is to update the feature of query data according to prototypes built using support data.
  2. The goal of SOMAML is to simultaneously update the features of query data and support data using prototypes built using support data and query data, respectively.

Obviously, in SOMAML, the update of encoder considers not only consider how close query features are to support features but also support features is to query features. In other words, the encoder does not consider the support features as fixed prototypes but prototypes that can be updated according to their distance to the prototypes built upon query features.

Below, we illustrate the difference of FOMAML and SOMAML in the feature space.

llustration of the distinction of FOMAML and SOMAML in the feature space. Conceptually speaking, the objective function of FOMAML aims to change the features of the query data; in contrast, that of SOMAML seeks to change the query’s features and support data simultaneously. In this figure, the support and query data features are plotted as circles. The different colors represent different classes. The solid and hollow arrows indicate the gradient calculated from positive and negative samples, respectively.

Results — A Simple Zeroing Trick Eliminates the Noise in MAML

In vanilla MAML, more inner-loop iterations generally yield better results.

Why?

From a contrastive learning perspective, we propose that a larger number of inner-loop updates mitigates the noise originating from a non-zero linear classifier. (Please refer to our derivation in our paper to get more insights.)

To strengthen our point, we show that when we perform the zeroing trick (i.e., zeroing the linear classifier at the first iteration of an inner loop), increasing the number of inner-loop updates does not matter.

Why? Because there is no noise as long as we apply the zeroing trick, so the supervised contrastiveness is accurate.

As a result, we do not need to increase the number of inner-loop iterations when the zeroing trick is applied.

With the zeroing trick, a larger number of inner loop update steps is not necessary. In original first-order MAML, a larger number of inner loop update steps is preferred as it generally yields better testing accuracy even with zeroing trick applied in the meta-testing stage (refer to the left figure). However, models trained using the zeroing trick do not show this trend (refer to the right figure).

Results — Effect of Initialization and the Zeroing Trick

Following the first experiment, a natural question arises: what if we zero the linear classifier, not at the first iteration of an inner loop but at the beginning of the training epoch, i.e., zero-initializing the linear classifier.

The result is surprisingly consistent.

We discover that decreasing the initialization norm of the weight of the linear classifier increases the testing accuracy. This again reinforces our argument that the non-zeroed linear classifier is the source of interference.

Effect of initialization and the zeroing trick on testing performance. Both reducing the norm of w0 (the weight of the linear classifier) and zeroing w0 each outer loop (i.e., the zeroing trick) increase the testing accuracy. The curves in red: models with w0 randomly initialized. The curves in orange/green: reducing the value of w0 at initialization by a factor of 0.7/ 0.5. The curve in blue: w0 is zero-initialized. The curve in blue: models trained with the zeroing trick.

Results — Calling for Using Zeroing Trick at Testing Stage.

In the meta-testing stage of the few-shot learning, we want to know if the model can quickly adapt to unseen classes of images (i.e., the query data) after showing a few examples of the unseen classes (i.e., support data). The procedure is simple. First, the support data and query data are drawn from a held-out meta-testing dataset. Then, the model is updated using the support data for a few steps (i.e., the inner loop). Finally, we compute the testing accuracy of the updated model on the query data.

In other words, the gradient descent operation is again required in the testing stage.

Therefore, it is natural to ask: does the performance change if we zero the linear classifier at the testing stage?

Unfortunately, the answer is yes, and there is a significant gap.

Below, we show the testing performance of MAML following conventional procedure (left) and the testing performance evaluated after “zeroing the linear classifier at the first iteration of the inner loop in the meta-testing stage (right).” We do see a significant margin between the testing performance with and without applying the zeroing trick in the testing stage. This also indicates that the norm of the linear classifier’s weight can also interfere with the testing performance.

If the norm of weight can dominate the testing performance of models trained using MAML, it can be dangerous. Because using existing regularization techniques during meta-training can decrease the norm of the weight of linear classifier and therefore yields better testing performance. Then the testing performance can not truly reflect the color of the encoder.

To faithfully evaluate the encoder’s capacity in learning generalizable representations, we urge to apply the zeroing trick during the testing stage.

Summary

This paper presents an extensive study to demystify how the seminal MAML algorithm guides the encoder to learn a general-purpose feature representation and how support and query data interact. Our analysis shows that MAML is implicitly a supervised contrastive learner using the support features as positive and negative samples to direct the update of the encoder. Moreover, we unveil an interference term hidden in MAML originated from the random initialization or cross-task interaction, which can impede the representation learning. Driven by our analysis, removing the interference term by a simple zeroing trick renders the model unbiased to seen or unseen tasks. Furthermore, we show constant improvements in the training and testing profiles with this zeroing trick, with experiments conducted on the mini-ImageNet and Omniglot datasets.

About the Author

Chia-Hsiang Kao is a medical school student in Taiwan and admitted as a CS Ph.D. student at Cornell University.

His goal is to develop robust and interpretable machine learning algorithms and systems that operate reliably even under challenging conditions. Along with his research goals, he is interested in model robustness, unbiased and generalized representation learning, explainable AI, and healthcare applications

--

--