Vision Transformers (ViT) for Self-Supervised Representation Learning: Masked Autoencoders
Masked Autoencoders Are Scalable Vision Learners
I am back on the series on ViTs for Self-supervised Representation Learning. ViTs are becoming extremely popular and there is a lot of effort put into expanding the boundaries of Neural networks in this particular field via ViTs. My last article combined 3 papers together; however, I decided to dedicate this piece to one particular work that I really liked. Additionally, this work is quite unique compared to the previous ones. I am leaving a link to my previous article below in case you have not read it yet (:
Approach
The proposed masked autoencoder (MAE) simply reconstructs the original data given its partial observation. Like all autoencoders, it has an encoder that maps the observed signal to a latent representation, whereas a decoder that reconstructs the original signal from the latent representation. However, the proposed architecture has an asymmetric design allowing the encoder to process only the partial, observed…