Vision-Language Pre-Training with Triple Contrastive Learning
Think of teaching a computer to ‘see’ and ‘understand’ the way we do. That’s the realm of vision-language pre-training. Researchers made a big breakthrough with a new method called ‘Triple Contrastive Learning’ (TCL) [Paper Link]. It’s like giving the computer lessons in both how images and text connect and how to understand the details within each. This makes the computer’s understanding of both images and text a whole lot better!”
1. Introduction:
Artificial intelligence is rapidly changing how computers interact with the world, and one fascinating area is where images and language meet. This fusion fuels incredible things — imagine a computer answering complex questions about a photo or finding the perfect image just from your description. Vision-Language Pre-Training (VLP) is the key: we feed computers massive datasets of images and text so they learn to understand both.
Traditionally, VLP has focused heavily on cross-modal alignment (CMA), trying to maximize the connection between an image and its caption. But this approach has limits — sometimes important details within the image or the text itself get overlooked, especially if the training data is a little messy (noisy).
Yang and their team tackled this head-on in their paper “Vision-Language Pre-Training with Triple Contrastive Learning”. Their approach, TCL, goes beyond simple CMA with clever self-supervision. It trains the computer to understand relationships within images and text separately. This extra attention to detail aims to create stronger representations, making the whole system more robust.
2. Differences from Existing Literature
While research in Vision-Language Pre-training (VLP) has progressed considerably due to self-supervised learning successes, existing approaches still contain several key limitations that need further exploring. Here is how this research solves them:
- Prioritizing Multi-Modal Interactions: As opposed to models like CLIP and ALIGN that optimize for visual tasks only, prioritizing cross-modal interactions in pre-training could help in supporting tasks that demand both visual and language understanding (e.g. VQA).
- Beyond Object-Based Features: OSCAR, UNIMO, VILLA, and UNITER are vulnerable to computational bottlenecks and limited visual feature quality due to relying on pre-trained object detectors. Moving away from region-based features could increase efficiency as well as representation.
- Align Before Fusion: As opposed to SOHO and ViLT models, prioritizing an alignment step between image and text features before fusing them may yield significant performance gains. ALBEF provides evidence of such performance improvements.
- Align Local and Global Representations: Maximizing mutual information between local regions and global representations can generate richer features while preventing unrelated aspects from being captured by them. Imagine an image where identifying emotion as much as objects was paramount; cross-modal interaction would be critical for success in such an instance.
This proposal emphasizes interaction, alignment, and multi-level supervision within the VLP pre-training pipeline, which has the potential to set it apart from current methodologies while leading to superior performance on downstream vision-language tasks.
3. Methodology
Let’s walk through each part step-by-step to understand it better!
At first glance, consider that the model consists of three key parts (see above image). A vision encoder to interpret images, a text encoder for processing text documents, and a fusion encoder are all vital components that leverage powerful transformer architecture to complete its mission.
Momentum’s Twist: Instead of keeping one encoder per modality, this method utilizes multiple copies that slowly update with time — this momentum-based technique helps the model learn more robust features.
3.1. Uni-modal Representation Learning.
- Handling Images: To make learning possible, the model takes one image and subtly modifies it to produce two related ‘views’, creating what are considered positive pairs that allow it to learn by comparing similar yet subtly distinct perspectives. For processing these views, patches of image data are broken off, with some information added about their positions before being fed through to its vision encoder.
- Handling Text: Text undergoes similar processing as images; it’s fed into a text encoder to extract linguistic features and structure the document accordingly. With one key exception — Alignment Before Fusion
3.2. Cross-Modal Alignment (CMA)
CMA aims to align embeddings of matched image-text pairs while pushing those of unmatched pairs apart. It maximizes the mutual information between matched pairs using InfoNCE loss, which represents a lower bound of MI. The CMA loss is formulated for both image-to-text and text-to-image alignments.
3.3 Intra-Modal Contrastive (IMC)
IMC learns the semantic difference between positive and negative samples within the same modality. For both visual and text inputs, contrastive loss is applied to maximize agreement between differently augmented views of the same data example. Combining CMA and IMC improves the quality of the learned representations (as shown in the figure below )and can further facilitate joint multi-modal learning in the fusion encoder
3.4 Local MI Maximization (LMI)
LMI encourages high mutual information between global representation and every local region of the input. By considering patch embeddings as positive examples and negative examples from other images in the batch, LMI maximizes the average MI between global and local regions.
3.5 Image-Text Matching (ITM)
ITM predicts whether image-text pairs are matched or not, treated as a binary classification problem. The fusion encoder takes visual and linguistic representations as input, with the [CLS] token serving as the joint representation, which is then fed into a fully- connected layer to predict the matching probability ϕ(I, T) where (I, T) is image-text pair. Cross-entropy loss is employed for ITM.
3.6 Masked Language Modeling (MLM)
MLM aims to predict ground truth labels of masked text tokens conditioned on surrounding text tokens and image representations. The loss is defined using cross-entropy loss similar to BERT.
Thus the overall training objective of the model is:
4. Experiments:
4.1. Pre-training Datasets:
- Utilized pre-training datasets include COCO, Visual Genome (VG), Conceptual Captions (CC), and SBU Captions, totaling 4.0M unique images and 5.1M image-text pairs.
- Additionally, the CC12M dataset was used, leading to large-scale pre-training data with 14.97M unique images and 16M image-text pairs.
4.2. Downstream Tasks:
4.2.1. Image-Text Retrieval:
- Tasks include text retrieval (TR) with image queries and image retrieval (IR) with text queries.
- Evaluated on Flickr30K and COCO datasets in both fine-tuning and zero-shot settings.
4.2.2. Visual Question Answering (VQA):
- Predicting answers given images and questions.
- Implemented as a generation problem.
4.2.3. Visual Entailment (SNLI-VE):
- Predicting if an image semantically entails a given text.
- Three-class classification problem.
4.2.4. Visual Reasoning (NLVR2):
- Determines if a natural language caption is true about a pair of photographs.
- Input: text and two images.
5. Implementation Details:
- Experiments were performed on 8 NVIDIA A100 GPUs with the PyTorch framework.
- Vision encoder: ViT-B/16 with 12 layers and 85.8M parameters.
- Text encoder and fusion encoder implemented by a 6-layer transformer.
- Model trained for 30 epochs with a batch size of 512.
- Mini-batch AdamW optimizer with weight decay of 0.02.
- Learning rate schedule: Warm up to 1e − 4 after 2,000 iterations, then decrease by cosine decay strategy.
- Various data augmentation techniques were applied.
6. Ablation Study
The ablation study explores the effectiveness of newly proposed modules, IMC and LMI, in enhancing multi-modal representation learning. IMC with stronger data augmentation significantly improves performance. Incorporating LMI further boosts performance, highlighting the importance of localized and structural information. Scaling up pre-training datasets to 14M samples notably enhances performance, suggesting potential for further improvement with larger datasets.
7. Results:
- Outperformed existing state-of-the-art methods in image-text retrieval, VQA, VE, and NLVR2 tasks.
- Significantly improved performance compared to baselines, especially in zero-shot transfer and fine-tuning settings.
- Ablation studies demonstrated the effectiveness of newly proposed modules (IMC and LMI) in improving multi-modal representation learning.
- Larger pre-training datasets led to significant performance boosts.
- The optimal momentum coefficient was observed as m = 0.5, contrary to previous findings.
Overall, the experiments showcase the effectiveness of the proposed method across various tasks and datasets, demonstrating improvements over existing state-of-the-art approaches.
8. Critical Analysis and Conclusions
8.1. Potential Limitations
- Data Bias: Like many AI models, TCL may reflect biases present in its training data. Underrepresenting certain groups might lead to poorer performance in some instances.
- Computational Cost: While innovative, this methodology could potentially be costly in terms of computing resources
8.2. Conclusions
The paper introduced TCL, a novel vision-language pretraining framework that goes beyond existing approaches by incorporating intra-modal supervision and leveraging local information through local mutual information maximization. By ensuring meaningful representations within each modality, TCL facilitates improved cross-modal alignment and joint multi-modal embedding learning. Experimental results on established benchmarks showcase TCL’s superiority over state-of-the-art methods, underscoring its efficacy in vision-language tasks.
9. References
- [1] Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. 2022. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15671–15680.
- [2] https://cinnamonai.medium.com/overview-of-the-vqa-problem-f96ba63f6fdf
- [3] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- [4] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021
- [5] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.
- [6] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409, 2020.
- [7] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195, 2020.
- [8] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
- [9] Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. Seeing out of the box: End-toend pre-training for vision-language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12976–12985, 2021.
- [10] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-andlanguage transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021.
- [11] Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. arXiv preprint arXiv:2107.07651, 2021.
- [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
This blog was written as a part of the coursework of GNR 638 ( Machine Learning for Remote Sensing II ) offered by IIT Bombay during Spring 2023–24 under the guidance of Prof. Biplab Banerjee.
Written by:
- Ayush Patil - 200070012
- Sartaj Islam - 200050128
- Margav Savsani - 200050072