Vision Transformers (ViT) for Self-Supervised Representation Learning: Masked Autoencoders

Published in

Deem.blogs

5 min readJun 21, 2022

Masked Autoencoders Are Scalable Vision Learners

I am back on the series on ViTs for Self-supervised Representation Learning. ViTs are becoming extremely popular and there is a lot of effort put into expanding the boundaries of Neural networks in this particular field via ViTs. My last article combined 3 papers together; however, I decided to dedicate this piece to one particular work that I really liked. Additionally, this work is quite unique compared to the previous ones. I am leaving a link to my previous article below in case you have not read it yet (:

Vision Transformers (ViT) for Self-Supervised Representation Learning (Part 1)

Here I am summarizing the recent works done on Vision Transformers (ViT) in the Self-Supervised (SSL) and Unsupervised…

medium.com

Approach

The proposed masked autoencoder (MAE) simply reconstructs the original data given its partial observation. Like all autoencoders, it has an encoder that maps the observed signal to a latent representation, whereas a decoder that reconstructs the original signal from the latent representation. However, the proposed architecture has an asymmetric design allowing the encoder to process only the partial, observed…

Vision Transformers (ViT) for Self-Supervised Representation Learning: Masked Autoencoders

Vision Transformers (ViT) for Self-Supervised Representation Learning (Part 1)

Here I am summarizing the recent works done on Vision Transformers (ViT) in the Self-Supervised (SSL) and Unsupervised…

Approach

Written by Ching (Chingis)