Multimodal AI: Our Approach, Explorations, and Use Cases

Published in

AI Practice and Data Engineering Practice, GovTech

14 min readJul 29, 2024

Introduction: Understanding Multimodal AI

This post kicks off a series exploring multimodal AI and its potential applications across various use cases in the public sector. First things first, what exactly is multimodal AI?

At its core, multimodal AI combines and processes different types of data — called modalities — in both input and output. This approach allows AI systems to understand and generate information across various formats, much like humans do when interpreting their environment. A prime example is OpenAI’s GPT-4o. Users can interact with it through speech, text, images, or videos, and receive responses in speech, text, or generated images.

Vision-Language Models (VLMs) are one of the most common and well-known types of multimodal AI. These models can:

Process images, and more recently videos
Understand accompanying text
Generate text responses (and maybe images for some models)

Think GPT4o but without voice.

Another kind of multimodal model that is crucial in our work are multimodal embedding models, with one commonly used example of it being CLIP (Contrastive Language-Image Pre-training). The unique feature of multimodal embedding models is the ability to generate embeddings within a shared embedding space, in this case for images and text. This means:

It can take either images or text as input
It outputs embeddings (compressed numerical representations)
The embeddings for related concepts are similar across modalities. For instance, the word “apple” and an image of an apple would have very similar embedding values, enabling powerful cross-modal search and retrieval capabilities.

While these models represent significant advancements, true multimodality — where a single model can process and generate any type of data seamlessly — remains an aspirational goal.

In this article and a few upcoming ones, we’ll explore how we’re leveraging these technologies to create innovative solutions for government agencies, always with an eye toward future developments in this rapidly evolving field.

Clarifying Our Approach to Multimodal AI

To set the stage, let’s address some common misconceptions and clarify our approach to multimodal AI:

1. We’re not training models from scratch — “So, are y’all building a ‘GovTech Griffin’ or something?”

We leverage existing research and models developed by well-resourced organisations. Our focus is on efficient use of our limited GPU resources, not training a model from scratch for the sake of it.

2. We’re not just calling APIs — “Why not just use ChatGPT for everything?”

While large API models are impressive, they can be sometimes too generic for specialised tasks. Our approach is to construct custom multimodal pipelines, developing self-hosted solutions for sensitive data, and fine-tuning permissively licensed open-source models when appropriate and practical. This strategy can outperform generic large API models in specific use cases.
However, we don’t over-engineer solutions. If a problem is simple enough to be solved with an API call and the data classification and sensitivity allows for it, we take that approach.

3. We’re focused on multimodal being vision and language (for now) — “But can your AI speak like ChatGPT?”

Our current scope of exploration centres on vision and language modalities as these two areas offer numerous opportunities across government agencies. We plan to expand to other modalities in the future, but the core of our multimodal work is on mastering VLMs.

4. We’re building flexible infrastructure, not chasing specific models — “Have you tried the latest model that just dropped?”

Our strategy is to develop best practices and robust infrastructure and create adaptable pipelines for various use cases so that we can easily swap in newer, better models as they emerge. This approach ensures our work remains relevant despite rapid model evolution.

Our approach to multimodal AI is one of pragmatism with a focus on efficiency. Our goal is to create lasting value through thoughtful application of technology. We are building not just for the present but also the future and seek to not just be use case-driven but also inspire use cases.

Ongoing Explorations in Multimodal AI

Our vision is to develop versatile multimodal AI solutions adaptable to specific agency needs while exploring broader techniques. We aim to strike a balance between versatility and specificity in our work.

VLMs typically excel at task-specific applications like image captioning and visual question answering, with some models capable of OCR, open-vocabulary object detection, and region-specific captioning.

Here are some specific areas that we have been (and are currently) exploring.

Small VLM Fine-tuning

We are investigating the potential of fine-tuning small, task-specialised VLMs. This approach aims to create small expert models that can potentially outperform their larger and more general counterparts in specific tasks. Recent trends in AI have seen a resurgence of smaller LLMs and VLMs, typically within a single-digit billion parameters range, or even in the millions range like Florence-2. These compact models offer the advantage of being able to run on-premise or on-device, making them more accessible and practical for our resource constraints.

To clarify, we are not suggesting that fine-tuning these VLMs will outperform other models in classic vision tasks like object detection or segmentation. Classic vision models like YOLO and RT-DETR will still excel at these tasks. Rather, we are focusing on more general and open-ended tasks, such as tagging and captioning, comparing against larger VLMs, not classic vision models.

Some explorations we have done encompass fine-tuning VLMs from model families such as Bunny, LLaVA, MiniCPM, and InternVL. We’re applying these models to tasks like detailed image captioning and tagging, utilising publicly available data that aligns with various agency use cases.

Through our explorations, we’ve gained several key insights:

Data quality is paramount for small models — We’ve observed that smaller models are significantly more sensitive to data quality compared to their larger counterparts. The adage “garbage in, garbage out” is particularly relevant here. To address this, we are prioritising thorough data cleaning processes and strive to ensure that our ground truth data closely aligns with our desired outcomes. This attention to data quality is crucial for the successful fine-tuning of small VLMs.
Synthetic data proves valuable for fine-tuning — Obtaining high-quality, real-world data can be challenging, especially during exploratory phases where formal project agreements may not be in place. Even when we do have access to real-life data, it’s often insufficient in quantity or quality for effective training. To overcome this hurdle, we’ve adopted the strategy of generating synthetic data. This involves using larger, more capable models to create ideal training examples through extensive prompt engineering. This approach allows us to supplement our training data with high-quality, task-specific examples.
Task-tuned small VLMs can outperform larger, general VLMs in specific benchmarks — Our internal benchmarks have yielded encouraging results. We’ve found that small VLMs, when finetuned on high-quality, task-specific data, can perform on par with — and sometimes even outperform — generic VLMs that are several times larger. This finding underscores the potential of specialised, efficient models in tackling specific tasks, challenging the notion that bigger is always better in the world of AI.

These learnings are shaping our approach to developing practical, efficient, and effective multimodal AI solutions for WOG. By focusing on small, specialised models, we’re aiming to create AI tools that are not only powerful but also resource-efficient and adaptable to specific use cases. We hope to share more specifically about our finetuning efforts in future articles.

Multimodal Prompting

Most VLMs typically require text prompting, necessitating prompt engineering exploration. Our investigations have yielded several interesting observations:

Effectiveness of simple prompts for VLMs: Open-source small to mid-sized VLMs respond best to straightforward, task-specific instructions. Unlike text-only LLMs, which often benefit from elaborate prompts, VLMs perform optimally with simple directives. For tasks such as image captioning or visual question answering, a concise prompt like “Describe this image in detail” typically outperforms more complex prompting schemes.
VLM training effects on prompt responsiveness: The training process of VLMs appears to influence their ability to handle varied text inputs. As these models are fine-tuned on vision-language tasks, they seem to lose some flexibility in processing diverse text prompts — a characteristic that differs from pure language models. For example, anecdotally, VLMs would tend to ignore instructions in prompts more often than LLMs. This reduced versatility may be due to the specific instruct-tuning datasets used in VLM training.
VLM-LLM chaining is useful for formatting prompts: Since VLMs are not so responsive to complex instructions currently, we find that chaining a VLM’s output to a LLM to further reformat or rephrase can help meet complex desired outcomes. The LLM need not be large, just good enough to be a bit more adept at language tasks. This approach allows us to leverage the strengths of both model types: the VLM’s ability to interpret visual information and the LLM’s proficiency in language manipulation and formatting.

As open-source VLMs continue to advance, there may be potential for more sophisticated prompting strategies in the future.

Multimodal Search and Retrieval

Search and retrieval is a well-established concept in the vision domain, predating the era of generative AI. In the generative AI era, it got rebranded to retrieval-augmented generation (RAG).

Extending search and retrieval to multimodal AI is a natural progression. It encompasses not only image-image or text-text retrieval but also cross-modal retrieval between image and text. The key to enabling this cross-modal functionality lies in unified embedding spaces provided by models like CLIP.

We also explored modality conversion which converts the image into text modality by using a VLM to generate a zero-shot caption. This grants an additional knowledge base to perform retrieval on, which is arguably a more “grounded” one compared to an image knowledge base for retrieval using text modality.

**For a hierarchical tagging task with domain-specific taxonomy, with multimodal search and retrieval, we saw significant boosts in performance using both self-hosted and API models.**

We’ve applied multimodal search and retrieval to two specific tasks:

Hierarchical image tagging: Incorporating search and retrieval into our existing tagging pipeline has yielded significant performance improvements up to 40+%.
Detailed image captioning: RAG has proven effective in aligning output captions with expected styles, as the model learns from retrieved context.

We hope to release an article in the coming weeks to go a lot more in depth into this area with a specific case study and use case.

Key learnings from our multimodal search and retrieval explorations:

Cross-modal retrieval efficacy: Multimodal embedding models have emerged as a cornerstone of multimodal operations, enabling cross-modal retrieval between images and text. This capability is crucial for effective multimodal search and retrieval.
Modality conversion enhances search and retrieval performance: Leveraging a VLM to generate zero-shot captions for images works. These captions serve as an additional knowledge base for RAG, effectively bridging the gap between visual and textual information.
RAG as a cost-effective performance booster: Compared to the resource-intensive process of model fine-tuning, RAG offers a relatively low-effort approach to significantly improve model performance. This makes it an attractive option for enhancing multimodal AI systems efficiently.

Multimodal Benchmarking

The challenge of benchmarking and performance evaluation in the era of LLMs extends to VLMs, presenting unique complexities. Public benchmarks and leaderboards, and traditional metrics such as BLEU, ROUGE, and CLIPScore often fall short in accurately reflecting real-world task performance, particularly for natural language outputs like detailed image captions. A significant limitation of reference-based performance metrics is that the reference itself is not always a “golden standard.”

While the model-as-a-judge approach, which utilises large API-based models to score responses for hard-to-quantify tasks, shows promise with alignment to human preferences in our preliminary tests, its scalability is limited by potentially high costs and its subjective nature.

Where possible, we create in-house benchmarks for specific tasks that are customised for success in those applications. These tailored benchmarks allow us to evaluate models based on criteria that are directly relevant to our use cases, providing a more accurate assessment of a model’s potential real-world performance.

As for how we evaluate new models for potential adoption, we follow a three-step process:

Vibes Check: We begin with an intuitive assessment to gauge the model’s initial feel and performance from a user perspective. This also sometimes involves looking at user sentiments from online sources like r/localllama or X. There’s nothing empirical about this, but sometimes intuition can be very useful.
Hard Sample Testing: If the model passes the vibes check, we proceed to test it against a set of challenging examples that have proven difficult for previous models.
Detailed Evaluation: Only models that show promise in the first two steps undergo our comprehensive evaluation process. This tiered approach helps us manage resources efficiently while ensuring thorough assessment of promising candidates.

Key learnings from our benchmarking efforts:

Limitations of Standard Metrics: Online leaderboards, popular benchmarks, and traditional metrics are good high-level indicators, but often fail to capture the nuances of real-world, task-specific performance. This observation underscores the need for developing custom evaluation methods or adapting existing metrics to better align with our specific use cases and contexts.
Efficacy of Incremental Evaluation: Our tiered approach (Vibes check -> Hard sample check -> Proper check) has proven effective in balancing thoroughness with efficiency. By starting with small-scale assessments and progressively moving to more comprehensive evaluations, we can identify promising models without expending unnecessary resources on less suitable candidates.
Value of Custom Benchmarks: Our in-house, task-specific benchmarks have proven invaluable in assessing models for particular applications. These custom evaluations provide insights that generic benchmarks often miss, allowing us to select models that are truly better for our specific needs.
Limitations of Reference-Based Metrics: We’ve recognised that reference-based performance metrics can be misleading when the reference itself isn’t a perfect standard. This is particularly true in tasks with multiple valid outputs, necessitating more flexible and nuanced evaluation approaches.

This benchmarking strategy allows us to navigate the complex landscape of VLM evaluation, ensuring that we select models that not only perform well on paper but also meet the practical needs of our specific applications.

Possible Multimodal Use Cases

After discussing our explorations and learnings in multimodal AI, it’s crucial to address how these efforts can be applied across various government sectors. Here are several examples of possible multimodal AI use cases that could be implemented in real-world applications, grouped by their primary function:

1. Image-Text Relevance

Image-text relevance refers to checking whether image-text pairs are relevant to each other. This use-case is primarily enabled by multimodal embedding models like CLIP to compare natural language text with images. This capability can be realised by converting both textual and visual data into a common format for comparison. The system would then assess the similarity between the text and image representations, establishing criteria to classify their level of relevance.

In practical use, this use case can enable and augment workflows that involve document processing, compliance verification, and public communication in government settings. For instance, it can be used to verify that images submitted with official reports or applications match the accompanying textual descriptions. This technology can also streamline administrative processes by automatically flagging discrepancies between visual and textual content, reducing the manual workload on officers and improving the accuracy of reviews.

Additionally, it can be employed in public-facing government websites and publications to ensure that images are appropriately matched with their captions or surrounding text, enhancing the clarity and effectiveness of government communications. In scenarios involving large-scale data management, this capability can assist in organising and validating extensive collections of government records, ensuring that visual data is correctly associated with its textual metadata.

2. Open Vocabulary Object Detection

Open vocabulary object detection refers to the ability to identify and locate objects in images or video without being limited to a predefined set of object categories. This use-case leverages AI models that can recognise a wide range of objects based on their general understanding of the visual world. Open vocabulary object detection is currently available on the Video Analytics System.

In practical applications, this technology can significantly enhance various monitoring, inspection, and analysis tasks. For instance, in environmental monitoring, it can identify diverse wildlife species in nature camera footage, even those not explicitly included in training data. This flexibility is particularly valuable in government contexts where the range of potential objects of interest is vast or constantly evolving. It reduces the need for frequent model updates and allows for more comprehensive and adaptable visual analysis across different departments and scenarios.

3. Media Tagging, Captioning, and Search

Media tagging, captioning, and search involve automatically assigning descriptive labels, generating detailed textual descriptions for images and videos, and enabling efficient retrieval of visual content. This use-case leverages advanced AI models to understand, describe, and index visual content with high specificity.

(Image of the cityscape is AI-generated)

This technology can significantly enhance content management, accessibility, and information retrieval across various government applications:

Digital Archives: It can automatically tag and describe historical photographs, making vast collections searchable. Researchers and the public can then easily find specific images using natural language queries, such as “find photos of urban development in the 1960s.”
Public Communications: For government publications and websites, it can generate accurate captions for images, ensuring clarity and context for all citizens, including those with visual impairments. The search capability allows staff to quickly locate appropriate images for specific communication needs.
Cross-Modal Search: By leveraging models like CLIP, the system enables powerful cross-modal search capabilities. Users can find images using text descriptions or even use an image to find similar visual content across large databases.

This capability is particularly valuable in contexts where large volumes of visual data need to be processed, organised, and made accessible. It can dramatically reduce the manual effort required for media management while improving the discoverability and usability of visual information. The addition of advanced search functionality transforms static visual archives into dynamic, easily navigable resources, enhancing decision-making processes and public service delivery.

This capability could be implemented using a combination of vision-language embedding and generative models for content analysis and tagging. The system might incorporate techniques such as RAG to enhance tagging and captioning accuracy by leveraging relevant contextual information. It could be designed to generate hierarchical tags, from broad categories to specific descriptors, facilitating multi-level organisation and retrieval. Additionally, the system could include captioning models capable of generating detailed, contextually relevant descriptions of visual content.

Conclusion: The Future of Multimodal AI in Government

As we’ve explored throughout this article, GovTech’s journey into multimodal AI is driven by a commitment to practical, efficient solutions that address real government needs. Our approach balances innovation with pragmatism, focusing on creating lasting value rather than chasing the latest trends.

Key takeaways from our multimodal AI initiatives thus far include:

The power of small, specialised models: Our initial explorations suggest that carefully tuned small VLMs can often match or exceed the performance of larger, more resource-intensive models for specific tasks.
The importance of adaptable infrastructure: Our focus on building flexible pipelines allows us to quickly incorporate new models and technologies as they emerge.
The value of multimodal search and retrieval: By extending search and retrieval and retrieval-augmented generation to the visual domain, we’ve significantly improved performance in tasks like image tagging and captioning.
The need for custom benchmarking: We’ve developed task-specific evaluation methods that more accurately reflect real-world performance than generic benchmarks.

These insights are shaping our approach to developing practical, efficient, and effective multimodal AI solutions for whole-of-government use. The potential use cases we’ve outlined — including image-text relevance checking, open vocabulary object detection, and advanced media tagging and captioning — demonstrate the wide-ranging impact that multimodal AI could have on government operations.

As multimodal AI technologies evolve, GovTech remains committed to harnessing their potential to create more responsive, efficient, and citizen-centric government services. By integrating cutting-edge AI research with practical government applications, we’re not just improving existing processes — we’re reimagining what’s possible in public sector technology.