Vision-Language Pre-Training with Triple Contrastive Learning

9 min readMar 30, 2024

--

Think of teaching a computer to ‘see’ and ‘understand’ the way we do. That’s the realm of vision-language pre-training. Researchers made a big breakthrough with a new method called ‘Triple Contrastive Learning’ (TCL) [Paper Link]. It’s like giving the computer lessons in both how images and text connect and how to understand the details within each. This makes the computer’s understanding of both images and text a whole lot better!”

1. Introduction:

Artificial intelligence is rapidly changing how computers interact with the world, and one fascinating area is where images and language meet. This fusion fuels incredible things — imagine a computer answering complex questions about a photo or finding the perfect image just from your description. Vision-Language Pre-Training (VLP) is the key: we feed computers massive datasets of images and text so they learn to understand both.

Traditionally, VLP has focused heavily on cross-modal alignment (CMA), trying to maximize the connection between an image and its caption. But this approach has limits — sometimes important details within the image or the text itself get overlooked, especially if the training data is a little messy (noisy).

Yang and their team tackled this head-on in their paper “Vision-Language Pre-Training with Triple Contrastive Learning”. Their approach, TCL, goes beyond simple CMA with clever self-supervision. It trains the computer to understand relationships within images and text separately. This extra attention to detail aims to create stronger representations, making the whole system more robust.

2. Differences from Existing Literature

While research in Vision-Language Pre-training (VLP) has progressed considerably due to self-supervised learning successes, existing approaches still contain several key limitations that need further exploring. Here is how this research solves them:

Prioritizing Multi-Modal Interactions: As opposed to models like CLIP and ALIGN that optimize for visual tasks only, prioritizing cross-modal interactions in pre-training could help in supporting tasks that demand both visual and language understanding (e.g. VQA).
Beyond Object-Based Features: OSCAR, UNIMO, VILLA, and UNITER are vulnerable to computational bottlenecks and limited visual feature quality due to relying on pre-trained object detectors. Moving away from region-based features could increase efficiency as well as representation.
Align Before Fusion: As opposed to SOHO and ViLT models, prioritizing an alignment step between image and text features before fusing them may yield significant performance gains. ALBEF provides evidence of such performance improvements.
Align Local and Global Representations: Maximizing mutual information between local regions and global representations can generate richer features while preventing unrelated aspects from being captured by them. Imagine an image where identifying emotion as much as objects was paramount; cross-modal interaction would be critical for success in such an instance.

This proposal emphasizes interaction, alignment, and multi-level supervision within the VLP pre-training pipeline, which has the potential to set it apart from current methodologies while leading to superior performance on downstream vision-language tasks.

3. Methodology

Let’s walk through each part step-by-step to understand it better!

This framework features a vision encoder, text encoder, and fusion encoder to better comprehend image and text data. Each encoder features momentum-based counterparts for greater stability; furthermore, each image receives two augmentations per image based on CMA, IMC, and LMI alignment features to both compare images to texts as well as align each modality independently to assist the fusion encoder in learning multimodal embeddings more quickly. Reference: [1]

At first glance, consider that the model consists of three key parts (see above image). A vision encoder to interpret images, a text encoder for processing text documents, and a fusion encoder are all vital components that leverage powerful transformer architecture to complete its mission.

Momentum’s Twist: Instead of keeping one encoder per modality, this method utilizes multiple copies that slowly update with time — this momentum-based technique helps the model learn more robust features.

3.1. Uni-modal Representation Learning.

Handling Images: To make learning possible, the model takes one image and subtly modifies it to produce two related ‘views’, creating what are considered positive pairs that allow it to learn by comparing similar yet subtly distinct perspectives. For processing these views, patches of image data are broken off, with some information added about their positions before being fed through to its vision encoder.
Handling Text: Text undergoes similar processing as images; it’s fed into a text encoder to extract linguistic features and structure the document accordingly. With one key exception — Alignment Before Fusion

3.2. Cross-Modal Alignment (CMA)

CMA aims to align embeddings of matched image-text pairs while pushing those of unmatched pairs apart. It maximizes the mutual information between matched pairs using InfoNCE loss, which represents a lower bound of MI. The CMA loss is formulated for both image-to-text and text-to-image alignments.

InfoNCE loss for image-to-text. Here τ is a temperature hyper-parameter, T̃ = {T̃₁, …, T̃ₖ} is a set of negative text examples that are not matched to I₁, sim(⋅) is similarity function. Reference: [1]

InfoNCE loss for f text-to-image. Here Ĩ = { Ī₁ …, Īₖ} is a queue of negative image examples that store the most recent K projected features. Reference: [1]

3.3 Intra-Modal Contrastive (IMC)

IMC learns the semantic difference between positive and negative samples within the same modality. For both visual and text inputs, contrastive loss is applied to maximize agreement between differently augmented views of the same data example. Combining CMA and IMC improves the quality of the learned representations (as shown in the figure below )and can further facilitate joint multi-modal learning in the fusion encoder

Consider the pink image and its two augmented versions (green). Focusing exclusively on cross-modal alignment (CMA), optimizing image embeddings to align closely with their text description is ineffective; by adding intra-modal supervision (IMC), we now ensure alignment of embeddings with their augmented twin (and learning to distinguish it from unrelated text/image pairs), leading to more robust and meaningful image representation (blue square). Reference: [1]

3.4 Local MI Maximization (LMI)

LMI encourages high mutual information between global representation and every local region of the input. By considering patch embeddings as positive examples and negative examples from other images in the batch, LMI maximizes the average MI between global and local regions.

Loss of LMI. Ĩₗ and T̃ₗ are in-batch negative image and text patch embeddings. Reference: [1]

3.5 Image-Text Matching (ITM)

ITM predicts whether image-text pairs are matched or not, treated as a binary classification problem. The fusion encoder takes visual and linguistic representations as input, with the [CLS] token serving as the joint representation, which is then fed into a fully- connected layer to predict the matching probability ϕ(I, T) where (I, T) is image-text pair. Cross-entropy loss is employed for ITM.

Loss of ITM. Here H(; ) is the cross-entropy and y denotes the label. Reference: [1]

3.6 Masked Language Modeling (MLM)

MLM aims to predict ground truth labels of masked text tokens conditioned on surrounding text tokens and image representations. The loss is defined using cross-entropy loss similar to BERT.

Loss of MLM. Here Tᵐˢᵏ represents the masked text tokens, Φ( I , Tᵐˢᵏ ) is predicted probability, and yᵀᵐˢᵏ is ground truth. Reference: [1].

Thus the overall training objective of the model is:

4. Experiments:

4.1. Pre-training Datasets:

Utilized pre-training datasets include COCO, Visual Genome (VG), Conceptual Captions (CC), and SBU Captions, totaling 4.0M unique images and 5.1M image-text pairs.
Additionally, the CC12M dataset was used, leading to large-scale pre-training data with 14.97M unique images and 16M image-text pairs.

4.2. Downstream Tasks:

4.2.1. Image-Text Retrieval:

Performance comparison on zero-shot image-text retrieval. Reference: [1]

Tasks include text retrieval (TR) with image queries and image retrieval (IR) with text queries.
Evaluated on Flickr30K and COCO datasets in both fine-tuning and zero-shot settings.

4.2.2. Visual Question Answering (VQA):

Predicting answers given images and questions.
Implemented as a generation problem.

4.2.3. Visual Entailment (SNLI-VE):

Predicting if an image semantically entails a given text.
Three-class classification problem.

4.2.4. Visual Reasoning (NLVR2):

Determines if a natural language caption is true about a pair of photographs.
Input: text and two images.

Performance comparison on vision+language tasks. Reference: [1]

5. Implementation Details:

Experiments were performed on 8 NVIDIA A100 GPUs with the PyTorch framework.
Vision encoder: ViT-B/16 with 12 layers and 85.8M parameters.
Text encoder and fusion encoder implemented by a 6-layer transformer.
Model trained for 30 epochs with a batch size of 512.
Mini-batch AdamW optimizer with weight decay of 0.02.
Learning rate schedule: Warm up to 1e − 4 after 2,000 iterations, then decrease by cosine decay strategy.
Various data augmentation techniques were applied.

6. Ablation Study

Ablation study of each component on image-text retrieval tasks. Reference: [1]

Ablation study of the size of pre-training datasets. Reference: [1]

The ablation study explores the effectiveness of newly proposed modules, IMC and LMI, in enhancing multi-modal representation learning. IMC with stronger data augmentation significantly improves performance. Incorporating LMI further boosts performance, highlighting the importance of localized and structural information. Scaling up pre-training datasets to 14M samples notably enhances performance, suggesting potential for further improvement with larger datasets.

7. Results:

Outperformed existing state-of-the-art methods in image-text retrieval, VQA, VE, and NLVR2 tasks.
Significantly improved performance compared to baselines, especially in zero-shot transfer and fine-tuning settings.
Ablation studies demonstrated the effectiveness of newly proposed modules (IMC and LMI) in improving multi-modal representation learning.
Larger pre-training datasets led to significant performance boosts.
The optimal momentum coefficient was observed as m = 0.5, contrary to previous findings.

Overall, the experiments showcase the effectiveness of the proposed method across various tasks and datasets, demonstrating improvements over existing state-of-the-art approaches.

8. Critical Analysis and Conclusions

8.1. Potential Limitations

Data Bias: Like many AI models, TCL may reflect biases present in its training data. Underrepresenting certain groups might lead to poorer performance in some instances.
Computational Cost: While innovative, this methodology could potentially be costly in terms of computing resources

8.2. Conclusions

The paper introduced TCL, a novel vision-language pretraining framework that goes beyond existing approaches by incorporating intra-modal supervision and leveraging local information through local mutual information maximization. By ensuring meaningful representations within each modality, TCL facilitates improved cross-modal alignment and joint multi-modal embedding learning. Experimental results on established benchmarks showcase TCL’s superiority over state-of-the-art methods, underscoring its efficacy in vision-language tasks.

9. References

This blog was written as a part of the coursework of GNR 638 ( Machine Learning for Remote Sensing II ) offered by IIT Bombay during Spring 2023–24 under the guidance of Prof. Biplab Banerjee.

Written by:

Ayush Patil - 200070012
Sartaj Islam - 200050128
Margav Savsani - 200050072