Aug 15 · 4 min read

Posed the scene-based question “How to chop these carrots?” even a dim human could respond “Use the knife!” A computer system however has never prepared a meal, and so requires vision-and-language training to interpret context before it can generate an answer. Most current research approaches in this field use separate language and vision models. But teaching systems to effectively process and provide their visual understandings through natural language requires deeper interpretation regarding the relationships between visuals and linguistics.

A team of researchers from the Georgia Institute of Technology, Facebook AI Research and Oregon State University has proposed ViLBERT (Vision-and-Language BERT), a novel model for visual grounding that can learn joint representations of image content and natural language, and leverage the connections across various vision-and language tasks.

Although benefits of the current “pretrain-then-transfer” learning approach used in computer vision and natural language processing (NLP) include ease-of-use and the strong representational power of large available models, the approach can also lead to myopic grounding due to limited or biased paired visiolinguistic data used for learning task-based groundings between vision and language. To avoid this, the researchers propose shifting the approach to pretraining for visual groundings.

The BERT language model has significantly advanced self-supervised learning on NLP tasks. Google released the huge model with its 24 Transformer blocks, 1024 hidden layers, and 340M parameters in 2018, and it quickly made its mark by setting records on 11 key NLP tasks. In the new study, researchers extend BERT into a joint visual-linguistic, task-agnostic model, linking separate streams for vision and language processing through co-attentional transformer layers. This maintains the different processing needs for each modality while also allowing them to interact at various representation depths.

Researchers trained ViLBERT on 3.1 million image-caption pairs from the Conceptual Captions dataset under two pretraining tasks: Masked multi-modal learning and Multi-modal alignment prediction. They transferred the pretrained ViLBERT model to four common vision-and-language tasks: Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Grounding Referring Expressions, and Caption-Based Image Retrieval.

The full ViLBERT model outperformed task-specific state-of-the-art models across the four tasks, with the most significant accuracy gains for VQA and Grounding Referring Expressions on the RefCOCO+ dataset. Based on the results, researchers propose that ViLBERT can learn critical visual-linguistic links that could be a helpful feature for downstream vision-and-language tasks. For example, a visual representation of dog breeds can be of greater use if the downstream model can associate it with accurate phrases like “beagle” or “shepherd.”

Researchers note that transferring the ViLBERT model to different tasks only requires adding a task-specific classifier. This simple implementation shows the enormous potential of the joint model in self-supervised learning across a wide range of vision-and-language tasks.

For many NLP researchers, the BERT language model is a gift that keeps on giving. Potential applications for the new joint vision-and-language ViLBERT model are many, and could include for example helping vision-impaired individuals better understand their surroundings in real time.

The paper ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks is on arXiv.

Journalist: Fangyu Cai | Editor: Michael Sarazen

We know you don’t want to miss any stories. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.


We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.


Written by


AI Technology & Industry Review — | Newsletter: | Become Synced Insight Partner: | Twitter: @Synced_Global


We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade