Paper Summary — ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Aditya Chinchure
Technonerds
Published in
4 min readMay 26, 2021

ViLBERT (Lu et al. 2019) stands for Vision-and-Language BERT. It is exactly what it sounds like, a version of the BERT model (Devlin et al. 2018) that quickly became SOTA for NLP tasks, with visual inputs incorporated. ViLBERT is a multi-task model for multimodal tasks like Visual Question-Answering (VQA) and Referring Expressions.

A summary of the method

The model is effectively a successor the the BERT model, and many parts of the BERT model remain unchanged in this method.

Images and text inputs are first processed separately. Text is encoded using several transformer layers independently from the image features. The image features are embedded into a from that can be input to a Transformer; bounding boxes are used to find and select image regions, and a vector is used to store the spatial location of each encoded image region. Following that, co-attentional transformer layers are introduced, where co-attention is used to learn the mapping between words in the text input and regions in the image. The model generates a hidden representation that can be used as a starting point for several multimodal tasks.

ViLBERT is first trained on the Conceptual Captions dataset, which contains images with captions related to the contents of the image. Once this stage is done, the model can be fine-tuned to perform other tasks like VQA.

What I found most interesting

Many parts of this method are not novel. Co-attention between image and text had been explored before. Moreover, this is a transfer learning method, where the model is learning from the 3.3 million image-caption pairs in the Conceptual Captions dataset, and then being fine-tuned to perform particular tasks with smaller datasets. This kind of transfer learning is already shown to work in both Vision and NLP contexts. Needless to say, I found many parts exciting, as this is one of the first papers in multimodal learning that I am reading.

Co-attention is quite an interesting topic. It is a simple modification to the usual attention mechanisms we see in ML models. Simply put, attention is a method where the model can look at a part of the input or a hidden representation while deriving a prediction. In co-attention, this attention is extend to attend to features of a different modality, i.e. the image co-transformer block sees representations from the encoded text, and vice-versa. There is a lot of detail related to Transformer models that I am leaving out for now.

The results of this model show that it was SOTA on many multimodal tasks. I see this as another win for the Transformer architecture and BERT. But I also see this as baseline for many future works — given more fine-tuning and modifications, this model will perform better on many specified tasks.

Why should you be excited (and why am I excited)?

After CNNs, it looks like Transformers are the next big step in machine learning applications.

This model is clearly good at visual grounding — matching parts of the image to words in text. I am most excited to see how such a model would perform on referring image segmentation, where the output is a full segmentation mask. A modified decoder and/or a separate segmentation pipeline could be necessary to get refined results.

I am writing a series of summaries of papers that I have been reading, mostly involving multimodal computer vision and NLP tasks. These summaries are in layman’s terms, and not detailed. You can find all the papers I have summarized here.

I am a student researcher at The University of British Columbia working on Vision and NLP tasks. If you are interested in these topics as well, let’s get in touch!

--

--

Aditya Chinchure
Technonerds

CS at UBC | Computer Vision Researcher | Photographer