A Leap Forward in Computer Vision: Facebook AI Says Masked Autoencoders Are Scalable Vision Learners
The paper Masked Autoencoders Are Scalable Vision Learners, published this week by Kai-Ming He, Xinlei Chen and their Facebook AI Research (FAIR) team, has become a hot topic in the computer vision community.
Systems employing masked language modelling such as Google’s BERT and their autoregressive counterparts like OpenAI’s GPT have achieved astonishing performance across a wide range of natural language processing (NLP) tasks and enabled the training of generalizable NLP models containing over one hundred billion parameters.
The progress and performance of autoencoding methods in computer vision however lags behind their proven NLP abilities. A question naturally arises: how does masked autoencoding differ in the vision and language domains? The FAIR paper addresses this, and demonstrates that masked autoencoders (MAE) can be scalable self-supervised learners for computer vision.
The researchers first examine the differences in masked autoencoding in the vision and language domains, explaining: 1) Until recently, the architectures were distinct; 2) Information density is different in language and vision; 3) The autoencoder’s decoder, which maps latent representations back to the input, plays a different role when reconstructing either text or images.
The team then presents a simple, effective, and scalable form of an MAE for visual representation learning. The idea behind the proposed MAE method is simple — random patches from the input image are masked, and the missing patches are then reconstructed in the pixel space. The team summarizes their MAE’s two-core design and approach as:
- We develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent…