Multimodality with OpenVINO — BLIP

Published in

OpenVINO-toolkit

6 min readMar 3, 2023

Picture it: a not-too-distant future where you can chat with AI as you do with your best friends. Maybe you will think about ChatGPT and BingChat., but those are just about text conversations! No, imagine you can also record an audio clip, a video, or an image, and the AI can take a look and chat with you about what’s happening.

Imagine the possibilities. You can ask the AI to describe what’s happening in a picture as if you were on a TV show. Or have it tell you a story based on a video you sent it. Feeling sad? Don’t worry; the AI can recommend a song to get you dancing again. And if you’re in a museum taking photos, the AI can tell you what’s coming up.

Maybe the technology is here, and we haven’t explored it yet. Imagine what it will be like to have a futuristic companion always ready to chat and fully aware of all your multimedia adventures. 🔮

For those multimedia adventures, I would like to explain a multimodality concept. But before talking about multimodality, let me explain what unimodality means because, in Deep Learning (DL), the concepts of unimodal and multimodal models are also relevant but with slightly different meanings.

Unimodal models in DL typically refer to models that take a single input type, such as an image or text data. For example, a convolutional neural network (CNN) is a unimodal model that inputs image data and learns to classify images based on their visual features.

On the other hand, multimodal models in DL refer to models that take multiple types of input data, such as image and text data. These models are designed to learn from multiple sources of information to make predictions or perform other tasks. For example, a multimodal model can take both an image and a textual description of that image and learn to generate a caption that describes the image.

For example, CLIP has initiated a significant shift from the unimodal approach to multimodality, which involves using multiple input data types, such as images and text, to perform tasks or make predictions. Multimodal models can take in various data types and employ them to achieve complex tasks. Here is the flow of CLIP.

CLIP model summary, source: Learning Transferable Visual Models From Natural Language Supervision

I recently wrote a blog post about the emergence of image generators from the text. But have you ever wondered about large language models and their potential when combined with vision? Imagine training a robust model like GPT-3, but this time using both images and text to learn from both sources of information. The result? It can be applied to various tasks, such as uploading an image and getting a description. You can also ask natural language questions about the image’s content.

BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) is a language-image pre-training framework for unified vision-language understanding and generation. BLIP achieves state-of-the-art results on a wide range of vision-language tasks. Visual language processing is a subset branch of an artificial intelligence field that focuses on creating algorithms designed to allow computers to understand images and their contents more accurately. The most popular tasks in this area of AI:

Text to Image Retrieval — a semantic task that aims to find the most relevant image for a given text description.
Image Captioning — a semantic task that aims to provide a text description for image content.
Visual Question Answering — a semantic task that aims to answer questions based on image content.

But wait, it gets even more exciting! Now, you can run these optimized models using OpenVINO. This means that you can get even faster and more efficient results. Are you ready to try it and see the power of combining text and vision?

Recently, in OpenVINO notebooks, our engineers have created an excellent notebook that teaches us how to use BLIP with OpenVINO. In this notebook, we learn and validate how to use models for Image captioning and Visual Question Answering. I want to share two flow diagrams that give us an idea of how each works.

Let’s start with the simpler one: “Image Captioning,” which generates a natural language description of an image. In the BLIP algorithm, image captioning is performed by combining a pre-trained image encoder with a transformer-based language model to generate a caption for the input image. The overall inference process works as follows:

BLIP Image Captioning general inference flow.

And training and fine-tuning can be categorized into these steps:

Image Encoding: The input image is first fed through a pre-trained convolutional neural network (CNN) that generates a fixed-length vector representation of the image, often referred to as an image feature vector. The image feature vector captures the crucial visual information in the image.
Text Generation: The image feature vector is then passed through a transformer-based language model that generates a caption for the input image. The transformer model takes the image feature vector as input and generates a sequence of words that describe the image. The transformer model is trained to attend to the relevant parts of the image while generating the caption.
Fine-Tuning: The entire image captioning model is then fine-tuned on a large-scale image captioning dataset. During fine-tuning, the model is trained to generate descriptive and natural-sounding captions.

The second one is Visual Question Answering (VQA). This algorithm is given an image and a natural language question about the image, and its task is to produce an accurate answer to the question. The overall inference process works as follows:

BLIP Visual Question Answering (VQA) general inference flow.

And training and fine-tuning can be categorized into these steps:

Image Encoding: The input image is fed through a pre-trained convolutional neural network (CNN) that produces a fixed-length vector representation of the image, often referred to as an image feature vector.
Question Encoding: Using a transformer-based language model, the natural language question is tokenized and encoded into a fixed-length vector. The encoder also better incorporates positional information to capture the tokens' order in the question.
Fusion: The image feature vector and the question encoding are combined using an attention mechanism. This allows the model to focus on the relevant parts of the image and the question while answering the question.
Answer Generation: The fused representation is passed through a classifier that produces a probability distribution over all possible answers to the question. The answer with the highest probability is then selected as the final output.

I think the flowcharts are even more straightforward than the explanation in the text, but those will give you a general idea of each follow. 😊

OpenVINO BLIP Notebook

Here is the notebook where you can try caption and answer generations based on images. https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/blip-visual-language-processing/blip-visual-language-processing.ipynb

Conclusions

BLIP algorithm leverages its pre-trained image and language components to perform accurate image captioning. By combining a pre-trained image encoder with a transformer-based language model and fine-tuning it on a large-scale dataset, the BLIP algorithm can generate high-quality captions for a wide range of images.

BLIP’s VQA algorithm takes advantage of its pre-trained language and vision components to perform accurate question answering on images. It also incorporates attention mechanisms to better fuse the image and question representations and generate more accurate answers.

Enjoy the blog and enjoy the notebook! 😊

#iamintel #openvino #blip #generativeai

About me:

Hi, all! My name is Paula Ramos. I have been an AI enthusiast and have worked with Computer Vision since the early 2000s. Developing novel integrated engineering technologies is my passion. I love to deploy solutions that real people can use to solve their equally real problems. If you’d like to share your ideas on how we could improve our community content, drop me a line! 😉 I will be happy to hear your feedback.

Here is my LinkedIn profile: https://www.linkedin.com/in/paula-ramos-41097319/

Multimodality with OpenVINO — BLIP

Written by Paula Ramos