Image and text features extraction with BLIP and BLIP-2: how to build a multimodal search engine

Connect images and text with the power of ViT and LLM to perform the image-text retrieval task

Enrico Randellini
11 min readSep 26, 2023
Image from https://www.italiaatavola.net/

Introduction

Images and language seem to belong to two distinct worlds,
as well as the problems generally related to them. It’s clear that to solve an image classification or an object segmentation task we are interested on the images, as well as to solve a sentiment analysis or an intent recognition task we are interested only on the related texts. But if I ask you to describe an image, a relationship is established between language and vision. New branches of research were born from the interactions between images and text, that are visual question answering, image captioning, image-text retrieval as well as generate new images from text prompt.

In this story I am interest in the image-text retrieval task. Suppose you have a databese where are stored thousands of images and you want to find just one of them. Some of the latest algorithms allow you to move in many ways. If you are lucky, you have a copy of that or you know the exact location in the database. Otherwise, you can start your search by using a similar image, by describing it or by combining both a similar image and a description of that.

Here, I want to show you how the algorithms BLIP [1] and BLIP-2 [2] work to solve the image-text retrieval task. In this regard, as a good Italian who loves cooking, I created a small toy dataset with various types of dishes. I downloaded from the web a total of 68 images, dividing them between pizzas, first courses, second courses and desserts. Thus, for each image I wrote its corresponding caption in Italian language. At the inference time, because the models are trained in English, I automatically translated each captions in English language making use of the deep-translator package. In Figure[1] an example for each dish.

Figure 1. Example of dishes used in the toy dataset. Each image is paired with a caption first written in Italian language and then translated to English

You can find the dataset and the code of this story in my GitHub repo at this link.

BLIP and BLIP-2

The attention mechanism has given considerable impetus to the development of new technologies for the analysis of text and images. Nowadays the transformers type architecture is used from large language model (LLL) such as BERT, GPT or Llama to Vision Transformers (ViT).

BLIP as well as BLIP-2 stands for Bootstrapping Language-Image Pre-training for unified vision-language undestranding and generation. By means of LLMs and ViT, BLIP and BLIP-2 obtain very impressive results on vision-language tasks such as image captioning, visual question answering and image-text retrieval. They are vision-language pre-training models. Strictly speaking, they are trained in two steps. First, they are pre-trained in a generic way, thus are fine-tuned on specific downstream tasks.

BLIP

The BLIP architecture is made of two components, see Figure[2]. A visual transformer as an image encoder, which divides an input image into patches and encodes them as a sequence of embeddings. And a transformer acting on the text that, depending on the case, could be an encoder or a decoder. In order to pre-train a unified model with both understanding and generation capabilities, the architecture is a multimodal mixiture of encoder-decoder, namely a multi-task model which can operate in one of the three functionalities:

  • unimodal encoder, which separately encodes image and text, where the text encoder is the same as BERT
  • image-grounded text encoder, which injects visual information by inserting one additional cross-attention layer between self-attention layer and the feed forward network for each transformer block of the text encoder
  • image-grounded text decoder, which replace the bidirectional self-attention layers in the image-grounded text encoder with causal self-attention layers
Figure 2. Pre-training model architecture and objectives of BLIP

The pre-training phase occurs by jointly optimizing the three following objectives

  • Image-Text Contrastive Loss (ITC) activates the unimodal encoder. It aims to align the feature space of the visual and text transformers by encouraging positive image-text pairs to have similar representations in contrast to the negative pairs
  • Image-Text Matching Loss (ITM) activates the image-grounded text encoder. It aims to learn image-text multimodal representation that captures the fine-grained alignement between vision and language. ITM is a binary classification task, where the model uses an ITM head to predict whether an image-text pair is positive (matched) or negative (unmatched) given their multimodal feature
  • Language Modeling Loss (LM) activates the image grounded text decoder, which aims to generate textual descriptions given an image

Each image-text pair only requires one forward pass through the visual transformer and three forward passes through the text transformer, where different functionalities are activated to compute the three losses. To perform efficient pre-training while leveraging multi-task learning, the text encoder and decoder share all the parameter except for the self-attention layers because is this layer that actually capture the differencies between encoding and decoding tasks.

The pre-training phase starts by initializing the image transformer from ViT-B pretrained on ImageNet and the text transformer from BERT base. As pretrained dataset is used the union of COCO, Visual Genome, Conceptual Captions, Conceptual 12M, SBU captions and the LAION dataset for a total of 129M of image-text pairs.

In the fine-tuning phase the model is optimized for that specific task. In particular, for the image-text retrieval task, the pretrained model is fine-tuned on COCO using ITC and ITM losses.

BLIP-2

Because the pretrained vision models offer high quality visual representation and pretrained LLMs offer strong language generation and zero shot transfer ability, the BLIP-2 architecture and its training strategies try to extract the best of the two components. To reduce computation cost and counteract the issue of catastrophic forgetting, the unimodal pre-trained models remain frozen during the pre-training phase.

BLIP-2 introduced a new component, that is a Querying Transformer (Q-Former), as a trainable module to bridge the gap between a frozen image encoder and a frozen LLM. As shown in Figure[3], the Q-Former is pre-trained with a two stage pre-training strategy. In the first pre-training stage, is performed a vision-language representation learning which enforces the Q-Former to learn visual representation most relevant to the text. In the second pre-training stage, is performed a vision-to-language generative learning by connecting the output of the Q-Former to a frozen LLM, and trains the Q-Former such that its output visual representation can be interpreted by the LLM.

Figure 3. BLIP-2 framework with the two stage pre-training strategy

As shown in Figure[4] the Q-Former consists of two transformer submodules sharing the same self-attention layers. There are an image transformer that interacts with the frozen image encoder for visual feature extraction, and a text transformer that works both as a text encoder and decoder. In input to the image transformer there is a set of 32 learnable query vectors, each one with dimension 768, that interact with each other through self-attention layers, and interact with frozen image features through cross-attention layers. The Q-Former is initialized with the pre-trained weights of BERT-base, whereas the cross-attention layers are randomly initialized.

Figure 4. (Left) Model architecture of Q-Former and BLIP-2’s first-stage vision-language representation learning objectives. (Right) The self-attention masking strategy for each objective to control query-text interaction

In the vision-language representation learning stage, the Q-Former is connected to a frozen image encoder and is pretrained using image-text pairs. The purpose of this stage is to train the Q-Former such that the queries can learn to extract visual representation that is most informative for the text. As for BLIP, are jointly optimized three pre-training objectives, namely the Image-Text Contrastive Learning, Image-grounded Text Generation and the Image-Text Matching. Each objective share the same input format and model parameters but employs a different attention masking strategy between queries and text to control their interaction (see Figure[4]).

Figure 5. BLIP-2’s second-stage vision-to-language generative pre-training, which bootstraps from frozen large language models (LLMs). (Top) Bootstrapping a decoder-based LLM (e.g. OPT). (Bottom) Bootstrapping an encoder-decoder-based LLM (e.g. FlanT5).

In the generative pre-training learning stage, Q-Former is connected on one side with the frozen visual encoder, and on the other side to a frozen LLM. As shown in Figure[5], a fully-connected (FC) layer projects the output query embeddings into the same dimension as the text embedding of the LLM. The projected query embeddings are then prepended to the input text embeddings. They function as soft visual prompts that condition the LLM on visual representation extracted by the Q-Former.

BLIP-2 has been experimented with two types of LLMs: decoder-based LLMs and encoder-decoder-based LLMs. For decoder-based LLMs, are used unsupervised-trained models of the OPT family. In this case the pretraining comes with the language modeling loss, where the frozen LLM is tasked to generate the text conditioned on the visual representation from Q-Former. For encoder-decoder-based LLMs, are used the instruction-trained models of the FlanT5 family. In this case the pretraining comes with the prefix language modeling loss, where the text is splitted into two parts. The prefix text is concatenated with the visual representation as input to the LLM’s encoder. The suffix text is used as the generation target for the LLM’s decoder.

For a more accurate understanding of BLIP-2, other then the original paper, I suggest the following very well done story

Experiments

For the image-text retrieval task I tried five models: BLIP pretrain, BLIP finetuned on COCO, BLIP-2 pretrain with ViT-g/14, BLIP-2 pretrain with ViT-L/14, BLIP-2 finetuned on COCO. In order to see the difference, for each model I executed the same pipeline, constructed the same similarity matrices and tested it on the same images and text.

Loading models

Both BLIP and BLIP-2 architecture are released in the LAVIS project whose GitHub is at this link. In order to use the pretrained models is sufficient to intstall the LAVIS library with the pip command

pip install salesforce-lavis

Once installed, from the model_zoo we can see all the supported models

from lavis.models import model_zoo
print(model_zoo)

==================================================
Architectures Types
==================================================
albef_classification ve
albef_feature_extractor base
albef_nlvr nlvr
albef_pretrain base
albef_retrieval coco, flickr
albef_vqa vqav2
alpro_qa msrvtt, msvd
alpro_retrieval msrvtt, didemo
blip_caption base_coco, large_coco
blip_classification base
blip_feature_extractor base
blip_image_text_matching base, large
blip_nlvr nlvr
blip_pretrain base
blip_retrieval coco, flickr
blip_vqa vqav2, okvqa, aokvqa
blip2_opt pretrain_opt2.7b, pretrain_opt6.7b, caption_coco_opt2.7b, caption_coco_opt6.7b
blip2_t5 pretrain_flant5xl, pretrain_flant5xl_vitL, pretrain_flant5xxl, caption_coco_flant5xl
blip2_feature_extractor pretrain, pretrain_vitL, coco
blip2 pretrain, pretrain_vitL, coco
blip2_image_text_matching pretrain, pretrain_vitL, coco
pnp_vqa base, large, 3b
pnp_unifiedqav2_fid
img2prompt_vqa base
clip_feature_extractor ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
clip ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
gpt_dialogue base

To perform the image-text retrieval tasks with the BLIP architecture, we can instatiate the base pretrained model with ViT-B image transformer and the image and text processors with the following command

To instantiate an image-text retrieval model with the BLIP-2 architecture, you have to use the same command with name = “blip2_feature_extractor” and model_type = “pretrain”, “pretrain_vitL” or “coco” for, respectively, the model with ViT-g/14 image transformer from EVA-CLIP, the ViT-L/14 image transformer from CLIP and the finetuned model on COCO dataset. As specified in the source code, the blip2_feature_extractor functionality is obtained with the first-stage model with Q-former and vision transformer.

Once the model is instatiated, we can load an image, translate the text from Italian to English language, preprocess both of them and, finally, extract the image, text and multimodal features embedding as follows

Unfortunately, the LAVIS package does not support the BLIP model finetuned on COCO. It is possible to test this model just using the code of the BLIP project released at this GitHub. To clone the project, instantiate the model, preprocess the image and text, and extract the image, text and multimodal features we have to perform the following steps

Similarity matrices

With the image-text pairs of the toy dataset at hands, for each model, I have calculated the list of image, text and multimodal vector embeddings. I wanted to see how the various embeddings relate to each other, thus I have calculated their cosine similarity. For each model, I have checked some combination of cosine similarity as the image-image, image-text, multimodal-multimodal and image-multimodal. In the following example the pipeline for a given model supported by the LAVIS package.

The similarities matrices show more or less the same pattern for each of the selected models. In Figure [6] and [7] are shown the results for the BLIP-2 pretrained ViT-L model. From Figure [6] left, you can see that for the image-image case, the highest values are concentrated along the diagonal, that is the image embedding features are grouped by dishes. However, especially because of pasta with tomato sauce, there are many high similarity scores between images of pasta and pizza. The same behavior, but with lower similarity values, is obtained for the image-text (Figure [6] right) and image-multimodal cases (Figure [7] left). The multimodal-multimodal case does not show particular patterns.

Figure 6. (Left) Image-image matrix similarity. (Right) Image-text matrix similarity. Both matrices refer to the BLIP-2 ViT-L model
Figure 6. (Left) Image-multimodal matrix similarity. (Right) multimodal-multimodal matrix similarity. Both matrices refer to the BLIP-2 ViT-L model

Search function

Once loaded the image-text pairs and calculated their image, text and multimodal embeddings, it’s time to search an image from the database of dishes. For each model I tried the image-image and a text-image modality search, although you can try any type of search combination depending on the available informations as shown in the function below.

In the following I show some results obtained with the BLIP-2 pretrained ViT-L model that, at least for these examples, seems to perform better respect the other models. I give a target image on the left and I ask to the model to give me the five most similar images in the dataset, those on the right.

In the following cases I give a target text on the left and I ask to the model to give me the five most similar images in the dataset, those on the right. Little mistake was made for the second request where the pappardelle with wild boar sauce are in second place in the search, while in first place there are the rigatoni all’amatriciana.

Try more experiments! Furthermore, if you have a dataset with a lot of image-text pairs, you can finetune the models on it. The BLIP and BLIP-2 projects give you all the code to perform finetuning operations on your custom dataset.

Thanks for reading. If you like it, make a clap!

References

[1] BLIP-2: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, https://arxiv.org/abs/2301.12597

[2] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, https://arxiv.org/abs/2301.12597

--

--