MAE. Masked Autoencoder is all you need for any modality — Method Summary

Published in

the last neural cell

6 min readSep 6, 2022

Advanced deep learning technology explains how it works

⚡ Introduction

To solve complicated tasks machine learning algorithm should understand data, extracting from it useful features. Usually training generalizing models requires a lot of annotated data. However it is expensive and in some cases impossible.

Masked Autoencoder technique allows to train model on unlabeled data and obtain surprisingly good feature representation for all common modalities.

BERT - MAE for text.
MAE - image
M3MAE - image + text
MAE that listen - audio spectrograms
VideoMAE - video

🔎 Contents

Explanation of MAE approach.
Recipe for all domains.
Crazy experimental results for all type of data.

⚡ Briefly

Masked Autoencoder is great and simple technique for pretraining transformer on any modality. It allows to get high level representation of data which is very helpful for adopting such model on any downstream task( transfer learning, finetuning).

🚀 Motivation

Self-supervised learning is an approach to get informative representation of data without any labels. Standard self-supervised learning techniques usually use advanced augmentation strategies. However there are modalities such as text, audio, brain signals etc. for which choosing augmentation might be a very tricky task.

Masked autoencoder, in contrast, allows not to think about augmentation. You just need data, a lot of data and computational resources.

💡 Masked autoencoder solves reconstruction task: predict the whole chunk of data based on a masked sample.

In case you masked around 70 % of data, model learns good high level representation for doing such task and profit.

⚙️ How does MAE work?

Let us illustrate a MAE working principle. Take a look at the picture below:

Figure 1. MAE for image self-supervised learning

First, pre-training involves masking out a lot of patches (e.g., 75%). Encoders are applied only to visible patches. Then, after the encoder, mask tokens are introduced, and the full set of encoded patches and mask tokens is decoded by a small decoder that reconstructs the original image. For recognition tasks, the encoder is applied to uncorrupted images (full sets of patches) after pre-training.

Recipe for any common modality:

Take data sample.
Divide this sample in regions (patches for image, word for text and so on)
Apply random masking with high ratio (75% is a good starting point)
Keep only visible parts and feed them into the transformer encoder. For introduction to vision transformers you can jump to our first summary.
Apply decoder to full set of masked and processed visible tokens. Train model to reconstruct masked tokens.
Do steps (1–5) many times
???
Profit! You have a good model, which is able to extract meaningful features from your data 😍

Next you can use the pretrained encoder, which learned useful representations of your data, for any downstream task.

💡Note that after training unmasked (full) samples can be fed into the model due to the transformer architecture which does not depend on length of the data.

🔬 What MAE is capable of?

MAEs can be easily adapted to different data modalities. Below you can find illustrations on use of MAE in video and audio applications. If you are inspired by these, don’t hesitate and adopt such technique for your data.

Here you can find authors comment about their technique for audio and video modalities:

Audio modality

We have explored a simple extension of MAE to audio data. Our Audio-MAE learns to reconstruct masked spectrogram patches from audio recordings and achieves state-of-the-art performance on six audio and speech classification tasks. We have drawn four interesting observations: First, a simple MAE approach works surprisingly well for audio spectrograms. Second, we find that it is possible to learn stronger representations with local self-attention in the decoder. Third, we show that masking can be applied to both pre-training and fine-tuning, improving accuracy and reducing training computation

Video modality

To do video self-supervised learning, VideoMAE uses a masked autoencoder and a plain ViT backbone. Compared to contrastive learning methods, VideoMAE has a much shorter pre-training time (3.2x speedup). In future research on self-supervised video training, VideoMAE might be a good starting point.

🔥 Experimental insights:

It is intriguing that all these MAE techniques beat SOTA result in their corresponding domains.

Image results

Audio results

Reconstruction results for audio spectrogram MAE

Results for video.

This figure emphasizes that MAE increase performance in the absence a access to lot of data

Training details

Here I add add information about training details. You can use it as starting point in you investigation. Loss function (MSE) is calculated only on invisible tokens.

Image

Video

❓ Pros and cons

Pros

Easy to apply for any modalities and maybe even multiple at once
Allows to adopt transformer for your specific task

Cons

This approach works only with transformer and do not work well using different architectures.
It requires a lot of data and computational resources.

📝 My notes

This is a really cool, powerful technique that can be used for any domain, and here’s why I think that works. During reconstruction, the model should capture high level representations of input samples in order to reconstruct 70% of the data. This is the only way a decoder can obtain the original image.

Some thought about further utilization of this idea.

It is a good opportunity to start using transformers to solve neuroscience problems. In addition, we can forget about convolutions 😊
Using MAE, we can represent multivariate time series as audio and provide the same pretraining. This might help a lot in solving downstream tasks with a small amount of labeled data.
The transformer generalizes well and can provide interpretation for hidden transformations.
Add an additional loss function and restriction. This will allow us to train the model to make reconstructions and divide representations for different people or groups.

🧬 Possible development & Perspectives

Multimodality → MultiMAE: Multi-modal Multi-task Masked Autoencoders
We can study how multi modal neurons influecne on perfomance. So it migght be promising area for collaboration with neuroscience.
Combine contrastive learning( augs) with MAE → Contrastive Masked Autoencoders are Stronger Vision Learners → https://arxiv.org/pdf/2207.13532.pdf

😉 Stay tuned and see you in our medium publication

Author: Alexander Kovalev

Collaborator: Alexey Timchenko

Our telegram channel: the last neural cell

If you would like to use this material, please refer to the main “the last neural cell” publication or the author. A lot of work is done for creating concise summaries of interesting papers on a non-commercial basis.