Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

Synced
SyncedReview
Published in
3 min readAug 30, 2022

--

In recent years the machine learning research community has turned its attention to the convergence of language, vision, and multimodal pretraining, aiming to develop general-purpose foundation models that can handle multiple modalities and be easily adapted to diverse downstream tasks.

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3 (BERT Pretraining of Image Transformers), a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

BEiT-3 performs masked “language” modeling on images, texts, and image-text pairs in a unified manner. It is pretrained on massive monomodal and multimodal data…

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global