Baidu’s 10-Billion Scale ERNIE-ViLG Unified Generative Pretraining Framework Achieves SOTA Performance on Bidirectional Vision-Language Generation Tasks

The emergence in recent years of powerful vision-language pretraining models has significantly boosted performance on a range of image-to-text generation tasks. The development of large-scale pretraining models for text-to-image…