12-in-1: Facebook AI’s New Framework Tackles Multiple Vision-and-Language Tasks

Published in

SyncedReview

3 min readDec 19, 2019

In recent years researchers in the busy deep learning, computer vision and natural language processing communities have all become increasingly interested in vision and language (V&L).

A compelling reason to study language and vision jointly is the promise of language as a universal and natural interface for visual reasoning problems — useful in both specifying a wide range of problems and communicating AI responses. However, previous research in visually-grounded language understanding have been mostly task-specific.

Researchers from the Facebook AI Research, Georgia Institute of Technology, and Oregon State University found that the skills required for different V&L tasks such as visual question answering and caption-based image retrieval overlap significantly, thanks mainly to the rise of V&L general architectures.

The wide variety of independent V&L tasks motivated these researchers explore ways to consolidate some of them — and the result of their efforts is an all-in-one model that learns from 12 supporting datasets of four broad categories of V&L tasks. The model reduces the number of parameters from some 3 billion to 270 million while improving task performance by an average of 2.05 points.

Based on the recently proposed ViLBERT (Vision-and-Language BERT) model for learning joint representations of image content and natural language, the new model focuses on four categories — visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification.

Among the 12 datasets are three for vocab-based VQA (VQAv2, GQA, and VGQA), two for image retrieval (COCO and Flickr30K), five for referring expressions (RefCOCO, RefCOCO+, RefCOCOG, Visual7W, and GuessWhat), and two for multi-modal verification (NLVR2 and SNLI-VE).

Previous V&L datasets were infamous for variations in size, quality, interface, and difficulty. The new research not only shows the possibility of using a single model to perform multiple tasks but also proves that even with the same architecture, training with multiple datasets can actually lead to improvements on task metrics compared to single-task training.

The paper further demonstrates that multi-task training can be an effective pretraining step for single-task models as it led to further gains and set a new state-of-the-art for 7 out of 12 dataset tasks.

The paper 12-in-1: Multi-Task Vision and Language Representation Learning is available on arXiv.

Journalist: Yuan Yuan | Editor: Michael Sarazen

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.

12-in-1: Facebook AI’s New Framework Tackles Multiple Vision-and-Language Tasks

Written by Synced