Femiloye Oyerinde
6 min readJun 12, 2023


Example of a visual-question-answering task (from BLIP-2’s paper)


Recent times have seen the emergence of multi-modal models with capabilities to do tasks, such as visual question answering, image captioning, image-to-text generation, and phrase grounding that were once considered impossible by leveraging powerful unimodal vision and language models. These models can extract meaningful representations from both image and text data and use them to solve complex tasks for which they are designed. The research area involving training such models is referred to as Vision-Language Pre-training.

Vision-language pre-training is an interdisciplinary research field that combines methods and knowledge from computer vision and natural language processing. Some researchers (Radford et al., 2021; Alayrac et al., 2022) have proposed methods to train such models. However, these methods incur a high computational cost as they are pre-trained end-to-end on large vision and language models (usually Transformers). Consequently, researchers are exploring alternative methods to train less complex vision-language models without the need to train end-to-end, which can lead to cost savings and improved efficiency — one of these methods is the BLIP-2.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

BLIP-2 (Li et al., 2023) proposes a method that enables the use of frozen vision and language models and sufficiently bridges the modality gap between the two unimodal models. It is compute-efficient since it has to train fewer parameters compared to previous approaches involving end-to-end training. The authors used a lightweight Querying Transformer to serve as a bottleneck between the frozen image and text encoders (models). First, the image is passed to the image encoder to extract visual features and the outputs are then passed to the language model to make sense of it. However, there is a challenge; since the frozen language model wasn’t trained on image data it can’t make a good interpretation of the extracted visual representations without further help. To solve this problem, Q-Former uses a set of learnable querying vectors and is pre-trained in two stages: (1) vision-language representation learning with a frozen image encoder and (2) vision-to-language generative learning stage with a frozen text encoder.

The Q-Former model

BLIP-2’s framework from the original paper.

Q-Former is a transformer-based architecture with two sub-modules: (1) an image transformer that interacts with the visual features from the frozen image encoder and (2) a text transformer that can encode and decode texts. It uses a set of learnable querying vectors to extract relevant visual features that capture the most informative part of the text that goes with the image. The learned queries are passed as input to the image transformer where they interact with each other through the self-attention layer and with the visual representations from the image encoder through the cross-attention layer as seen in the image below.

Pre-Training Stages

1. Vision-Language Representation Learning

At this stage, the querying vectors are only able to learn good representation conditioned on the text through the joint optimization of three objectives discussed briefly below.

The first stage of Q-Former representation learning pre-training (adapted from the original paper)
  • Image-Text Contrastive Learning (ITC): The objective is to align the image representation with the text representation, that is, if you compute the similarity score between the two representations it should be very high. This would mean that the Q-Former can capture visual features from the image that align with what the language model could interpret or decode. The mutual information between the text-image representations is maximized by contrasting the similarities of positive image-text pairs to negative pairs (similar to Radford et al., 2021). From the output query representation Z (32 x 768) of the image transformer (remember a sub-module of Q-former) and the text representation t which is the output embedding of the [CLS] token, the pairwise similarity between each of Z and t is computed and the highest match is selected.
  • Image-grounded Text Generation (ITG): The objective here is to condition Q-Former to generate texts by accepting images as input. This way it can learn how to use the querying vectors to extract relevant visual features that the language model can understand and process.
  • Image-Text Matching (ITM): The output of the query embeddings Z is used to perform a binary classification task where the goal is to predict whether an image-text pair matches or not. This further helps the Q-Former to output querying embeddings that align with the text tokens representation.

2. Vision-Language Generative Learning

During the vision-language generative learning pre-training stage, two variants of LLM (Large Language Model) architectures were considered: a decoder based and an encoder-decoder-based LLM as depicted in the image below.

Generative learning stage of the Q-Former model (adapted from the original paper)

The Q-former is connected to a frozen LLM and pre-trained such that the LLM’s generative ability is fully harnessed. The output query embeddings from the image transformer are linearly projected to the same dimension as the text embeddings that the language model expects. These projected query embeddings are then concatenated (beginning) to the text embeddings to serve as soft visual prompts that force the language model to focus on the visual features extracted by the Q-Former. For the decoder-based architecture, the LLM was tasked with predicting the text that goes with the image from the visual features encoded by the Q-Former, while for the encoder-decoder-based, the text accompanying the image is split into two parts, the first part is joined with the visual representations, and the LLM is tasked with predicting the second part of the text.


Conclusively, through Q-Former it is possible to harvest the capabilities of already trained powerful vision and language models without having to update their weights when applied to downstream tasks such as visual question answering and image-text generation. Q-former bridges the modality gap between two modalities and successfully aligns their representation with improved performance on benchmark datasets to show for it. It outperforms Flamingo, a large vision-language model, by 8.7% on zero-shot Visual Question Answering (QVA), despite having 54 times fewer trainable parameters.


Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., and Simonyan, K. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.

Li, J., Li, D., Savarese, S., & Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597, 2023.

Hugging Face. (n.d.). A Dive into Vision-Language Models. Retrieved from https://huggingface.co/blog/vision_language_pretraining



Femiloye Oyerinde

A Computer Vision Engineer with research interests in representation learning, self-supervision and vision-language.