Vision Transformers (ViT) for Self-Supervised Representation Learning: Masked Autoencoders

Ching (Chingis)
Deem.blogs
Published in
5 min readJun 21, 2022

--

Masked Autoencoders Are Scalable Vision Learners

I am back on the series on ViTs for Self-supervised Representation Learning. ViTs are becoming extremely popular and there is a lot of effort put into expanding the boundaries of Neural networks in this particular field via ViTs. My last article combined 3 papers together; however, I decided to dedicate this piece to one particular work that I really liked. Additionally, this work is quite unique compared to the previous ones. I am leaving a link to my previous article below in case you have not read it yet (:

Approach

paper

The proposed masked autoencoder (MAE) simply reconstructs the original data given its partial observation. Like all autoencoders, it has an encoder that maps the observed signal to a latent representation, whereas a decoder that reconstructs the original signal from the latent representation. However, the proposed architecture has an asymmetric design allowing the encoder to process only the partial, observed…

--

--

Ching (Chingis)
Deem.blogs

I am a passionate student. I enjoy studying and sharing my knowledge. Follow me/Connect with me and join my journey.