Introduction to Visual-Language Model

11 min readOct 16, 2023

Visual-Language Model (VLM) has been popular among researchers since 2015, though it became more popular in 2020–21 with the emergence of OpenAI’s CLIP and Google’s ALIGN. The first known paper in this area was published in 2015. “Image Captioning with Neural Networks” by Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan in 2015. This paper delves into using deep learning techniques to automatically generate captions for images. However, significant work started in 2019 with transformer architecture and a dual-stream architecture-based vision-language model.

From a technology perspective, VLMs came into existence due to limitations in current computer vision and language models. Conventional computer vision models perform very well in identifying objects but often struggle with understanding context, semantic gaps, and the significance and correlation of the objects in an image. Computer vision models are limited to analyzing visual images and do not have generative language capabilities. On the other hand, language models perform extremely well with language and text. VLMs bring the best of both worlds (vison and language) and make them even more versatile.

1.1 Limitation of current technology led to VLMs

Computer Vision’s Bottlenecks: Traditional computer vision models are undoubtedly skilled when it comes to pinpointing objects within visual datasets. However, they falter when expected to grasp the broader context, comprehend semantic irregularities, and interpret object-to-object interactions. These models only focus on processing visual information, so they are unable to take into account the nuances that linguistic data can provide.

The Limitations of Language Models: Large language models, on the other hand, exhibit a flair for text analytics and generation. However, they fall short in decoding visual cues. Additionally, these models can sometimes grapple with linguistic ambiguities and are handicapped when it comes to verifying their interpretations against real-world visual references, considering they only operate within the domain of textual data.

The Obstacles in Current Approaches: The rise of deep learning has certainly reshaped the computer vision landscape. Despite this, pressing issues remain unresolved. Creating a dataset that serves the purpose of visual recognition is not only labor-intensive but also financially burdensome. These datasets, more often than not, focus narrowly on a select range of visual concepts. Moreover, while certain models may excel in benchmark tests, they often exhibit poor resilience in stress scenarios, raising questions about the universal applicability of deep learning in the realm of computer vision.

The Need for a Hybrid Solution: Visual-Language Models aim to bridge these gaps by merging the analytical and generative power of language models with the object recognition capabilities of computer vision models. The objective is to create a more holistic model that can handle both text and images at the same time, thus offering a more rounded understanding of real-world data.h

1.2 Introduction

The last couple of years (2021–2022) have been the era of large language models. Now is the time for VLMs, which have the potential to change the way we interact, communicate, and engage in complex tasks using AI.

For example, in healthcare, where precision, timeliness, wider reach, and cost effectiveness are vital, VLMs have the potential to make a measurable impact. Through it’s capacity to understand complex medical test reports like X-rays or MRI scans, a more sophisticated approach to medical diagnostics and treatment plans is available.

Here is an example taken from Yakoub Bazi, "Vision-Language Model for Visual Question Answering in Medical Imagery”, March 2023, to demonstrate VLM's potential in healthcare.

PathVQA images, questions, and answers (DOI:10.3390/bioengineering10030380)

Another example to demonstrate knowledge extraction and reasoning along with visual summarization, visual question answering, image captioning, and semantic search across modality is taken from Wenhai Wang and others “VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks”, May 2023.

OpenGVLab: VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

2. Main tasks performed by VLMs

Generation Tasks

Visual Question Answering: Visual Question Answering (VQA) consists of interpreting visual elements such as images or videos to provide textual responses to specific queries. VQA can play a great role in interactive educational platforms, virtual customer assistance, and helping visually impaired people. The user experience is one of the main contributors for any successful transformation. A VLM model has the potential to enhance the user experience exponentially by understanding visual data in detail and answering contextual queries.

Visual Captioning: Generate descriptive text captions for visual elements like images or videos. Visual captioning, along with translation services, has great potential in education, entertainment, news and media, and many other fields. This task is also valuable in various applications, including image search engines, accessibility services for the visually impaired, and various content management systems.

Visual Commonsense Reasoning: Visual Commonsense Reasoning focuses on assessing the underlying correlation, associations, and finest details in visual content, including images and videos. This simulates human cognitive visual perception in addition to going to the last level of details in a visual content. The model can identify objects, understand correlations among the objects, and maintain context. The model can even predict the object’s behavior, making it useful for self-driving cars, surveillance systems, and robotics that interact with their environment.

Visual Generation: The creation of new images or videos based on textual descriptions or prompts is gaining enormous popularity. This is an evolving and fast-growing industry where synthetic video and images are used in advertising, education, graphics design, virtual reality, and many other use cases. The percentage of AI-generated content on the internet is growing exponentially, and in a few years, the volume of AI-generated content will be more than what humans have created so far.

Visual Summarization: The task here involves condensing a large set of visual materials like images or videos into succinct and informative text-based summaries. This again involves a deep understanding of images, context, and relationships between objects, and then summarizing the image and video. The model can encapsulate the core elements, objects, and themes and provide concise or detailed textual narration as asked. The uses and applications vary, from museum and art summaries to e-commerce product listings or surveillance footage analysis.

Classification Tasks

Multimodal Affective Computing: In this aspect, the emphasis is on the interpretation of both visual and text-based inputs to discern emotional states or moods. The integration of multimodal affective computing is pivotal to human-computer interaction. This brings empathetic and context-sensitive responses to various human-centric uses. These models will understand, differentiate, and interpret human emotions and then act accordingly. This capability helps build applications like mental health tracking, empathetic grievance systems, and interactive entertainment platforms.

Natural Language for Visual Reasoning: This task evaluates the credibility of textual statements that describe visual elements like images or videos. With the emergence of generative AI, factual verification of the extracted and generated information is the most complex problem to solve. Natural Language for Visual Reasoning helps with fact verification and content moderation for images and videos.

Retrieval Tasks

Visual Retrieval: Visual retrieval is one of the most complex processes of finding and pulling out relevant visual content based on text-based queries or descriptions for a VLM. VLM for Visual Retrieval is a big change in how we search content. By asking just once, one can find specific pictures in large repositories or content management systems. Businesses no longer rely only on textual information but need related pictures for more information and visual details. For example, online shopping is one area where visual retrieval helps more than text. Customers can find items by showing pictures. It is easier to go through media collections, saving many hours. VLMs very nicely connect pictures with words, making it easy for everyone to get data.

In the present AI world, visual retrieval includes online shopping, health services, art and museums, property, clothing & fashion, entertainment & media, police and law enforcement, automobiles, travel, and farming.

Vision-Language Navigation (VLN): Vision-Language models will change the navigation landscape in a few years. VLN bridges the gap between visual cues and linguistic instructions. Users can now guide systems using natural language combined with real-world images. This fusion ensures precise localization and enhanced understanding in complex environments. This concept is vital for applications such as autonomous vehicles, robotics, and augmented reality experiences, where understanding both visual cues and natural language is essential for navigation. VLN can help technicians locate and fix specific components in large manufacturing or industrial units quickly. One can simply describe or show a visual clue, and the system can guide them to the exact spot. This can greatly reduce the time spent searching and increase the efficiency of maintenance operations. Other uses could be assistive technologies for visually impaired people or navigating customers to desired products or stores.

Translation Tasks

Multimodal Machine Translation: Vision-language models can translate text while considering additional visual context, such as images or videos. This task enhances the accuracy and richness of translations, making them valuable for applications like international e-commerce, cross-cultural social media platforms, and global news dissemination. VLMs play a pivotal role in translation by bridging visual perception and linguistic understanding. By processing images alongside their associated text, VLMs provide contextual nuances often missed by traditional text-only models. These models have the capability to understand and translate ambiguous phrases with accuracy with the help of visual content. MMT has applications in translating image captions, comics, or instructional materials where visuals are integral.

3. Technical Components

Often confused with text-to-image models, vision-language models process both images and natural language text to perform various tasks. These models are designed for tasks such as image captioning, image-text matching, and answering visual and textual questions. VLM consists of three key elements:
• an image encoder
• a text encoder, and
• a strategy to fuse information from the two encoders

Unlocking the Power of Vision-Language Models: A Generic Architecture Representation

Whether you’re a researcher, a developer, or simply an AI enthusiast who is keen to leverage AI capabilities in real life, the world of Vision-Language Models has something to offer everyone.

4. VLM in action

Vision-language models have shown significant practical utility in real-world applications. Google Photos, Pinterest Visual Search, Snapchat, Facebook Rosetta, Airbnb’s Amenities Detection, and Ebay’s Image Search are using VLMs that can understand both pictures and words. VLM-led capabilities are helping these applications take an edge over their competitors. The main features of these applications are based on the VLMs. However, Facebook’s Automatic Alt Text, Google’s Cloud Vision API, and a few other tech giants and startups have not been very successful and, at times, have not captured the full context or nuances of images. The challenges faced by these VLM applications emphasize the complexities of combining visual and textual understanding and the importance of rigorous testing, refinement, and consideration of ethical implications.

The best of the VLMs is yet to come. However, the uses are growing exponentially. In banks, these models help by answering customer questions, which can be in the form of words or pictures, quickly and correctly. They also help in checking the details on ID cards faster.
In shops, these models make shopping easier. They can find products in the shop from photos taken by customers. They also help in keeping track of products by reading and understanding the labels. Online, they help customers find products they like by understanding both the picture and the words in their search.

In manufacturing, these models look at the pictures of products and their details to check if they are made correctly. They also help make sure safety rules are followed by understanding signs and symbols.
Other uses include helping people who speak different languages and even checking the mood of what people are saying online in memes, cartoons, images, and videos. This makes them useful in many different areas, helping to make things safer and better for users.

5. Available Options

OpenAI’s DALL-E 3.0 generates high-quality images from text, while CLIP understands images in the context of natural language, both showcasing multimodal learning. Google’s ViLBERT and LXMERT, along with VL-T5, UNITER, and Visual BERT, are designed for various vision-language tasks like visual question answering and object recognition using Transformer-based architectures. Facebook AI’s MMF serves as a framework for developing vision-language models, and ERIN focuses on generating narratives for image sequences. Microsoft’s Oscar takes a unique approach by aligning object and attribute representations in both images and text, making it versatile for tasks like image captioning and text-based image retrieval.

A few open-source vision language models exist that can be further fine-tuned. Alibaba has made significant contributions with its Qwen-VL and Qwen-VL-Chat models, which are designed for multi-round question answering in both English and Chinese. MiniGPT-4, another noteworthy model, extends Vicuna’s capabilities to match the performance of GPT-4, incorporating vision capabilities akin to BLIP-2. OpenFlamingo is a collaborative effort from multiple research institutions aiming to replicate DeepMind’s Flamingo models, offering a framework for training large autoregressive vision-language models. VisionLLM provides a unique approach by treating images as a “foreign language,” thereby aligning vision-centric tasks with language tasks through language instructions. CLIP has been a game-changer in the field, effectively bridging visual and linguistic data for various applications. ALIGN focuses on aligning images and text in a joint embedding space. Lastly, SimVL specializes in image retrieval and captioning tasks, and the research paper “Learning to Prompt for Vision-Language Models” offers innovative methods for prompting these models.

6. Summary

Vision-Language Models (VLMs) represent a transformative leap in the field of artificial intelligence, merging the capabilities of natural language processing and computer vision. These models have found practical applications across diverse sectors, including banking, retail, and manufacturing, enhancing user experience, operational efficiency, and safety. However, the journey is not without its challenges, such as data bias and algorithmic limitations, underscoring the need for rigorous testing and validation. With a variety of models like OpenAI’s DALL-E 3.0 and CLIP, Google’s ViLBERT, and Microsoft’s Oscar, the landscape is rich for further exploration and innovation.

Whether you’re a researcher, developer, or AI enthusiast, the exciting world of VLMs offers exciting opportunities for impactful real-world applications.
This area is comparative less explored yet has immense potential. This is becoming one of the most favorite topics for researchers.

7. References

8. Blob Series?

This blog is part of a series aiming to provide a comprehensive guide to VLMs, covering their architecture, applications, how to train and fine-tune a VLM, and finally ongoing research and future prospects.

8.1. Published so far

Introduction to Vision-Language Models

8.2. What’s next?

Present and future uses of Vision-Language Models
Off the Self Vision-Language Models
Business considerations: Why use Vision-Language Models?
Technical considerations: What to consider for Vision-Language Models implementation?
Training & fine-tuning your own Vision-Language Models
Ongoing Research & Future of Vision-Language Models

Stay tuned for actionable insights on implementing vision-language models effectively in your organization or creating your own VLM.