Recent Advances in Vision-and-Language Pre-training | CVPR 2022 Tutorial

7 min readDec 23, 2022

Note: Full video can be found here. It is the tutorial on CVPR 2022 Tutorial on “Recent Advances in Vision-and-Language Pre-training” by Lijuan Wang, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Kevin Lin, Linjie Li, Chung-Ching Lin, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Chenfei Wu.

Vision-and-Language Pre-training

The tutorial covers different topics, including:

(1). Region-Feature-based and End-to-End Image-Text Pre-training
(2). Unified Vision-Language Modeling
(3). Unified Vision-Language Modeling Extension to Video-Language Pre-training
(4). Learning Visual Models from Language Supervision
(5). Visual Synthesis

First, we are going to introduce the overviews of image-text pre-training.

Application

(1). Multi-Modal Retrieval

Figure 1: Image-to-text and text-to-image retrieval. [2]

The left part introduces image-to-text retrieval. Given an image query, the model can output the corresponding text from its text corpus.
The right part introduces text-to-image retrieval. Given a text query, the model can output the corresponding image from its image corpus.

(2). Image Captioning

Figure 2: Example of image captioning. [2]

The image captioning model generate a corresponding description conditioned on a given image.

(3). Image Question Answering

Figure 3: Example of image question answering. [3]

Image question answering is similar to image captioning. Specifically, image question answering task will give a question, and expect the model can answer this question conditioned on the image information.
The key problem in image-text pre-training is how to enable model to understand the latent relation between image and text.
— Using Large-scale dataset of (image, text) pairs.

Figure 4: Example of image-text pairs. [2]

Figure 5: Example of image-caption(text) pairs. [4]

Pre-training tasks

(1). Image-Text Contrastive (ITC) Loss

Figure 6: Example of Image-text contrastive (ITC) loss. [2]

The contrastive objective enables the model to locate text in the image by jointly training the image and text encoder together to maximize the cosine similarity of the paired image-text embedding.
Thus, image-text contrastive loss aims to find the best match image embedding from a batch of image embeddings given a text embedding. Similarly, finding the best match text embedding from a batch of text embeddings given an image embedding. [5]

(2). Image-Text Matching (ITM) Loss

Figure 7: Example of image-text matching (ITM) loss. [2]

Image-text matching (ITM) refer to given a pair input, we have to discriminate the match pair from unmatched pair.

(3). Masked Language Modeling (MLM) Loss

Figure 8: Example of masked language modeling (MLM) loss. [2]

In masked language modeling, the objective aims to predict the content of the masked text tokens. Then, masked language modeling task aims to recover the masked word tokens based on their left-context and right-context information of unmasked word tokens. [5]

(4). Language Modeling (LM)

Figure 9: Example of language modeling (LM) loss. [2]

Language modeling aims to predict the next token given the previous tokens. The obvious disadvantage is that we have only left-only context information.

Both masked language modeling and language modeling have its own advantages and disadvantages, the below study shows that how to benefit from both advantages.

Recent Study

In “CM3: A Casual Masked Multimodal Model of the Internet” [6], they proposed a hybrid of casual and masked language model.

Figure 10: Hybrid of casual and masked language model. [6]

In hybrid of casual and masked language model, they would mask a word span and move this span to the end of the sentence. The original location of masked word span would be replaced with <mask> token. Then the model works like casual language model, generating tokens from left to right while also masking a small number of token spans that would later be generated at the end of the string, instead of their original positions. [6]

Since they have moved the masked tokens to the end of the sentence, the model saw the masked tokens after seeing the left and right context of the original location of masked tokens, model learns bidirectional context information. [6]

They train models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). [6]

Figure 11: Training data of CM3 [6]

Then, we are going to introduce different uni-modality and cross-modality tasks CM3 has done.

(1). Image Modality — Unconditional Image Generation

Figure 12: Example of unconditional image generation [6]

Using unconditional image generation prompt “<img” to generate an image with a corresponding text (caption). If you only want to generate image, using prompt “img src=” instead.

(2). Image Modality — Conditional Image Generation

Figure 13: Example of conditional image generation [6]

Using conditional generation prompt as below

Figure 14: conditional image generation prompt [6]

Images generated by CM3 would be selected by CLIP [8] and output the top-k results in order to provide high-quality images. However, there are some mistakes in the above images.
In the second row, there are no red car in the second image.
In the third row, model can generate the shape of sheep but cannot depict explicitly the face of sheep.
Additionally, since CM3 was trained on large-scale web (mostly news articles) and Wikipedia articles, CM3 is not good at generating fictional images such as “a giraffe dragon chimera”.
On the other hand, DALL-E [7], who is expert in generating such fictional, creative, original images. Given the text prompt “a professional high quality illustration of a giraffe dragon chimera. a giraffe imitating a dragon. a giraffe made of dragon.”, DALL-E can output the images illustrated as below. [9]

Figure 15: A giraffe dragon chimera generated by DALL-E [9]

(3). Text-Image Modality — Captioning

Using text-Image captioning prompt as below

Figure 16: Text-Image captioning prompt. [6]

Figure 17: Text-Image captioning result. [6]

CM3-Caption-Beam means images generated by CM3 would be selected by means of beam search.
CM3-Caption-CLIP means images generated by CM3 would be selected by CLIP.
Since we cannot directly, fairly compare the result of CM3-Caption-Beam and CM3-Caption-CLIP, instead, we can use BERTscore [10] to evaluate the results.

Figure 18: Comparison of BERTscore. [10]

BERTScore aims to evaluate semantic similarity between captions and ground-truth. We can see that CM3-Caption-CLIP outperforms CM3-Caption-CLIP.

Conclusion

Image-text pre-training and corresponding tasks is very popular research topic nowadays. How to find proper approach to exploit the cross-modal interactions between individual elements, across image and text is a core issue.

As the result we have showed in the above, CLIP, DALL-E, CM3, etc., show comprehensive abilities to handle multimodal tasks. Specifically, CM3 models can generate diverse and rich structured multi-modal outputs. During training, it can implicitly learn from a wide range of text, image, and cross-modal tasks. In the downstream tasks, CM3 can be prompted to do unconditional/conditional image generation (like DALL-E) and do image captioning tasks. [6]

Reference

[1] CVPR 2022 Tutorial on “Recent Advances in Vision-and-Language Pre-training”.

[2] CVPR 2022 Tutorial on “Recent Advances in Vision-and-Language Pre-training” | Overview of Image-Text Pre-training Slides.

[3] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, 2016.

[4] Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning, 2018.

[5] Vision-Language Pre-Training for Boosting Scene Text Detectors, 2022.

[6] CM3: A Casual Masked Multimodal Model of the Internet, 2022.

[7] Zero-Shot Text-to-Image Generation, 2021.

[8] Learning Transferable Visual Models From Natural Language Supervision, 2021.

[9] Introduction to DALL-E.

[10] BERTScore: BERTSCORE: EVALUATING TEXT GENERATION WITH BERT, 2019.