ClipCap: CLIP Prefix for Image Captioning

Uppalamukesh
7 min readMar 30, 2024

--

Paper: https://arxiv.org/abs/2111.09734

Image captioning is a task where captions are generated for a given input image. In this task, a model is expected to capture the intricacies of the image to generate descriptive captions. For this, the model should be able to capture relations between visual and semantic features while training.
Typical image captioning frameworks involve an encoder-decoder framework where the encoder projects the visual cues of the image to a latent space from which the textual decoder “decodes” the latent space vectors to captions. These models tend to be resource-hungry and take a lot of time to train since they needs to learn the relation between textual and visual cues especially when trained from scratch.

To mitigate these problems the authors of this paper proposed a new model called Clip-Cap, which uses a pre-trained CLIP model and GPT-2 model to generate captions. The CLIP model is trained on at least 3 million pairs of images and captions. It is trained using a contrastive loss which enables it to create a shared representation which captures the correlation between visual and textual representations. Hence using this model would help save training time and data requirements.

In Clip-Cap, the authors propose producing a fixed-size prefix embedding for each caption by applying a mapping network over the generated CLIP embeddings. This prefix is then concatenated to caption embeddings and fed to a language model which is then finetuned along with the mapping network training. In this case, the authors chose to go along with GPT-2 as the language model due to its generative abilities.
The main contributions of this paper are:

1) A lightweight captioning approach that utilizes pre-trained frozen models for both visual and textual processing.

2) Even when the language model is fine-tuned, our approach is simpler and faster to train, while demonstrating comparable results to state-of-the-art over challenging datasets.

Method

Img from ClipCap: CLIP Prefix for Image Captioning

Objective
Given the dataset, the goal is to learn how to generate meaningful captions given an unseen input image. The captions can be represented as a sequence of tokens which can be padded to maximal length l. The training objective then becomes:

Since the prefix already contains required semantic information, the authors use a language model that predicts the next token without considering the future tokens. Thus the objective becomes:

Overview
The visual features are extracted from the visual encoder of a pre-trained CLIP model. Next, a mapping network is used to map the CLIP embedding to k embedding vectors where each embedding vector has the same dimension as a word embedding. Then the overall embedding is obtained by concatenating the visual embedding to the caption embeddings. This overall embedding is then fed into the language model which then generates the captions in an autoregressive manner.

Now the authors propose 2 variants to this architecture:

1) In the first variant, they propose fine-tuning the language model while training the mapping network (which is an MLP in this case). CLIP is not finetuned as the authors claim that there is no benefit in resulting quality and at the same time there is also an increase in training time. The authors postulate that the CLIP space already encapsulates the required information and hence adapting it to specific styles does not contribute to flexibility.
2) In the second variant, the language model is fixed during training and only the mapping network is trained. Here, it is observed that it is better to use a transformer mapping network here to compensate for the fixed language model. The transformer network has two inputs — the CLIP embeddings and a learned constant input. The constant input gives two advantages, it helps in retrieving meaningful information from CLIP embedding through multi-head attention. Second, it helps the language model to adapt to the new data thereby improving the generated captions.

Training and Validation

Datasets: The COCO captions, nocaps, and Conceptual Captions are used.
The COCO dataset is split according to the Karpathy split, dividing the dataset into training, validation, and test sets. The training set contains 120,000 images with 5 captions each.
The Nocaps dataset is designed to test the ability of models on unseen images and classes. It contains of validation and test sets with the train set being the COCO dataset. This dataset has three parts: “in-domain” contains images portraying only COC classes, “near-domain” contains both COCO and novel classes, and “out-of-domain” has only novel classes.
The Conceptual Captions dataset consists of 3M pairs of images and captions, it is more challenging than COCO as it has a larger variety of styles of images and captions.

Evaluation metrics: For the COCO dataset BLEU, METEOR, CIDEr, and SPICE metrics are implemented. For nocaps CIDEr and SPICE are used and for Conceptual Captions ROUGE-L, CIDEr and SPICE scores are reported.

Results

Images from ClipCap: CLIP Prefix for Image Captioning paper
Image from ClipCap: CLIP Prefix for Image Captioning

Observations
1) Fine tuning the language model might be susceptible to overfitting especially as the number of trainable parameters increases. Over the extremely complicated conceptual captions dataset, superior results were obtained with fine-tuning while avoiding fine-tuning achieving better results on the COCO dataset. On the nocaps dataset, both models have similar performance hence it would be preferred to use the lighter model, avoiding fine-tuning. From these observations, the authors conclude that datasets that present a unique style requiring more expressiveness are more likely to benefit from fine-tuning.

2) Increasing the size of the prefix length, up to a certain value improves performance, as shown by Li and Liang (https://arxiv.org/abs/2101.00190).

Image from ClipCap: CLIP Prefix for Image Captioning paper
Image from ClipCap: CLIP Prefix for Image Captioning paper

As can be seen from the figures, increasing the prefix size while allowing fine-tuning results in overfitting to the training set.

However, without fine-tuning, there is improvement in both training and test evaluations.

3) As can be seen, using an MLP as the mapping network yields better results when the language model is finetuned. However, when the language model is frozen, using a transformer mapping network gives better results. Hence, it can be concluded that when fine-tuning the language model, the expressive power of transformers is unnecessary

Critical Analysis

  1. As we can see, from the results, there is a significant speedup in training time bringing the training time down to 80h from 1200h along with better results demonstrating the effectiveness of pre-trained models like CLIP.
  2. The reliance on CLIP for visual feature extraction may introduce biases inherited from the training data. As a result, the model’s performance may be limited in scenarios where CLIP fails to capture relevant visual features. Improving the object detection capabilities of CLIP could help mitigate this issue and enhance the model’s robustness across diverse datasets.

Conclusions

In conclusion, Clip-Cap offers a promising direction for image captioning by leveraging pre-trained models and proposing a streamlined architecture. This approach essentially learns to adapt existing semantic understanding of the pre-trained models to the style of the target dataset, instead of learning new semantic entities. However, several challenges need to be addressed, including mitigating biases and conducting thorough evaluations across diverse datasets. Addressing these issues could further enhance the effectiveness and applicability of the proposed approach in real-world scenarios.

References

  1. Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning.
  2. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation.
  3. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision.

Authors

  1. Uppala Mukesh (@Uppalamukesh)
  2. Rhishabh Suneeth (@Rhishabhsuneeth)

--

--