Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks
In recent years the machine learning research community has turned its attention to the convergence of language, vision, and multimodal pretraining, aiming to develop general-purpose foundation models that can handle multiple modalities and be easily adapted to diverse downstream tasks.
In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3 (BERT Pretraining of Image Transformers), a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.
BEiT-3 performs masked “language” modeling on images, texts, and image-text pairs in a unified manner. It is pretrained on massive monomodal and multimodal data…