Exploring Multimodal Large Language Models: A Step Forward in AI

Shubham Karwa
13 min readNov 16, 2023

In the dynamic realm of artificial intelligence, the advent of Multimodal Large Language Models (MLLMs) is revolutionizing how we interact with technology. These cutting-edge models extend beyond the conventional text-based interfaces, heralding a new era where AI understands and generates content across a spectrum of formats including text, images, audio, and video. This article aims to unravel the intricacies of multimodal LLMs, illustrating how they are not just transforming the AI landscape but are also redefining the boundaries of human-computer interaction. We will explore their unique ability to integrate and interpret diverse forms of data, offering unprecedented levels of contextual understanding and interaction.

Multimodal can mean one or more of the following:

  1. Input and output are of different modalities (e.g. text-to-image, image-to-text)
  2. Inputs are multimodal (e.g. a system that can process both text and images)
  3. Outputs are multimodal (e.g. a system that can generate both text and images)

Let’s look at GPT-4V, the benchmark when it comes to vision LLMs.

GPT-4V: The industry leader

Features:

  • Closed Source
  • Architecture undisclosed by OpenAI
  • Uses a vision transformer and a visual language model
  • Excellent performance over a wide range of use cases
  • Trained on 1.7T data points
  • State of the Art Optical Character Recognition (OCR)

Multimodal Language Models (LLMs) are designed to handle and generate content across multiple modalities, combining text with other forms of data such as images, audio, or video. Here are a few use cases for Multimodal LLMs showcased using GPT-4V:

  • Digitizing note taking:

The left side in the image provided shows a picture of handwritten notes in a notebook and the right side is the digitized text that was extracted after this image was provided as a prompt to GPT-4V.

  • Understanding complex things:

Understanding complex parking signs or deciphering ancient handwriting can be easily done using GPT-4V.

  • Convert screenshots to usable code:
    In the image given below, the first component is a screenshot of a design that was provided to GPT-4V and the second is the output that was produced which is a component written in React using MUI components.
  • Turning whiteboards into usable code, diagrams or reports:
  • Turning designs into websites:

Multimodal Large Language Models (MLLMs) combine the capabilities of natural language processing (NLP) with other modalities such as images, audio, or video. The architecture and functioning of multimodal LLMs can vary, but they generally follow a similar pattern. Here’s a high-level overview of how they work:

  1. An encoder for each data modality to generate the embeddings for data of that modality.
  2. A way to align embeddings of different modalities into the same multimodal embedding space.
  3. [Generative models only] A language model to generate text responses. Since inputs can contain both text and visuals, new techniques need to be developed to allow the language model to condition its responses on not just text, but also visuals.

Now that we have seen some use cases of GPT-4Vs visual prowes, lets go over the architecture of Vision Transformer(ViT) which is the encoder for these visual large language models.

Vision Transformer(ViT)

Vision Transformers (ViTs) are a type of artificial intelligence model that are particularly designed for processing images. They represent a significant shift in the way images are handled in machine learning, diverging from the more traditional Convolutional Neural Networks (CNNs) that have been dominant in image processing for many years. Here’s a breakdown of their key aspects:

1. Transformer Architecture: Originally developed for natural language processing tasks, transformers are models that excel in capturing relationships and dependencies within data. Vision Transformers adapt this architecture to handle image data.

2. Image Processing Method: Unlike CNNs, which process images through a series of localized filters (convolutions), ViTs divide an image into a sequence of smaller fixed-size patches. Each patch is then flattened and linearly transformed into an embedding. These embeddings are fed into the transformer model.

3. Attention Mechanism: The core of a transformer is the attention mechanism, which allows the model to focus on different parts of the image sequentially, understanding the context and relationships between different image patches. This mechanism is particularly adept at capturing global dependencies within the image.

4. Scalability and Efficiency: ViTs are highly scalable and can benefit significantly from increased data and computational power. They have shown remarkable efficiency and accuracy, particularly in large-scale image recognition tasks, sometimes outperforming traditional CNNs.

5. Applications: Vision Transformers have been applied in various domains, such as image classification, object detection, and even in areas beyond traditional image processing, like medical image analysis.

6. Data-Intensive Nature: One of the challenges with ViTs is that they typically require large amounts of data to achieve their best performance. This data-intensity can make them less accessible for tasks with limited data available.

7. Adaptable and Generalizable: ViTs have demonstrated an ability to generalize well across different tasks and datasets, making them a versatile tool in the machine learning toolkit.

The architecture:

ViTs work by patching the images and generating embedding for each flattened patch. Then, these embeddings are classified by passing them through a transformer encoder such as GPT.

Now that we have a fair idea about Vision Transformers, let’s discuss some of the alternative free-to-use Large Multimodal Models (LMMs) available today:

Macaw LLM

Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image 🖼️, video 📹, audio 🎵, and text 📝 data, built upon the foundations of CLIP, Whisper, and LLaMA.

Features:

  • Open Source, Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 License
  • Image, Audio, Video and Text Integration
  • Uses CLIP for encoding images and video frames, Whisper for encoding audio data, Llama/Vicuna/Bloom for text and response generation

Architecture:

Performance: The model is under evaluation by the authors.

Examples: (provided by authors, check references)

ImageBind by Meta

ImageBind is the inaugural AI model capable of simultaneously binding data from six modalities without the need for explicit supervision. The model is designed to create a unified feature space that accommodates diverse modalities, including images and video, audio, text, depth, thermal, and inertial measurement units (IMUs) , without the need for fine-tuning. This breakthrough has contributed to advancing AI by empowering machines to more effectively analyze various forms of information collectively.

Features:

  • Open Source, Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) 4.0 License
  • Six modalities: Images and video, audio, text, depth, thermal and inertial measurement units (IMUs)
  • Learns a single embedding space that binds multiple sensory inputs together
  • Enables audio-based search, cross-modal search, multimodal arithmetic, and cross-modal generation.

Architecture:

ImageBind is based on the CLIP architecture. This image shown below is an expanded version of the CLIP paper. Just like CLIP, ImageBind utilizes the InfoNCE loss in the same way.

Performance:

ImageBind outperforms specialist models like AudioMAE and MultiMAE for a few modalities.

NOTE: The following three models (LLaVA, NExT-GPT and CogVLM) were tested against three common examples, Extreme ironing, complex parking signs and whiteboard to website flow. I have provided my views on each of the models’ performance on each of the given examples.

LLaVA: Large Language and Vision Assistant

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Features:

  • Open Source, Apache 2.0 License
  • Connection between OpenAI’s CLIP ViT-L/14 and Vicuna/LLaMa LLM
  • Works by employing a projection matrix from the supplied image and feeding it to the LLM
  • Llava achieves 85.1% relative score compared to GPT-4

Architecture:

Performance:

The chart below shows the performance of LLaVA in comparison to Science-QA as compared to other LLMs.

Examples:

Example 1: Extreme Ironing

In this example, we can see that LLaVA provides good explanation about the unusual nature of the image. Although, it does not consider it as humorous, there are no hallucinations or irrelevant data.

Example 2: Complex Parking Signs

LLaVA’s OCR reads the first big parking signs which says “NO PARKING” in red and concludes that it is not advisable to park here. But, there is a sign in the image which says “1 Hour Parking Mon-Fri 4PM-6PM”. The OCR capabilities of LLaVA are not good enough to extract all the information from this image. The output is wrong and the information provided is invalid with respect to the prompt.

Example 3: Whiteboard Website

Unlike GPT-4V which can output a basic website out of this whiteboard, LLaVA is unable to produce any code. But, it understands the flow completely and tells what the flow is about. If we use something like langchain and pass this output into another prompt template, we can create a specific tool that can do this specific task. So, it is not completely unusable, but does not fulfil the ask in the prompt either.

NExT-GPT

NExT-GPT is the first end-to-end Multimodal-LLM that perceives input and generates output in arbitrary combinations (any-to-any) of text, image, video, and audio and beyond.

Features:

  • Open Source, BSD 3-Clause “New” or “Revised” License
  • Uses ImageBind for multimodal embedding and creates projection layers
  • StableDiffusion, AudioLDM, Zeroscope for Image, Audio and Video output modalities respectively.
  • Can be further finetuned easily to reduce known hallucinations
  • Gives better results as compared to LLaVA in some cases

Architecture:

  • Multimodal Encoding Stage. Leveraging existing well-established models to encode inputs of various modalities. Here we take advantage of the ImageBind, which is a unified high-performance encoder across six modalities. Then, via the linear projection layer, different input representations are mapped into language-like representations that are comprehensible to the LLM.
  • LLM Understanding and Reasoning Stage. An LLM is used as the core agent of NExT-GPT. Technically, we employ the Vicuna. LLM takes as input the representations from different modalities and carries out semantic understanding and reasoning over the inputs. It outputs 1) the textual responses directly, and 2) signal tokens of each modality that serve as instructions to dictate the decoding layers whether to generate multimodal contents, and what content to produce if yes.
  • Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders. Technically, we employ the current off-the-shelf latent conditioned diffusion models of different modal generations, i.e., Stable Diffusion (SD) for image synthesis, Zeroscope for video synthesis, and AudioLDM for audio synthesis.

Performance:

There is no clear evaluation given by the authors on their GitHub. NExT-GPT’s best performance is the Text and Audio input to produce Images, followed by the Text, Audio, and Image input to produce Image results. The least performing action is the Text and Video input to produce Video output.

An example of the NExT-GPT capability is shown in the image below:

Examples:

Example 1: Extreme Ironing

NExT-GPT’s output for this example does not capture the fact that the man is ironing clothes. It thinks that the man is performing some stunt or “trying to reach a higher location” which is irrelevant. This explanation is a simple hallucination.

Example 2: Complex Parking Signs

Due to incorrect OCR, NExT-GPT straight-up refuses to answer in a yes or no. It does provide some next steps that one can take in order to know the answer. But that is not an option for someone who is maybe standing in the middle of the road, trying to figure out if its okay to park or not.

Example 3: Whiteboard Website

NExT-GPT is unable to output the code for this whiteboard website flow. It provides a pythonic comment, consisting of list of UI elements from any generic website. It calls that as the code and says that it has followed best practices.

CogVLM

  • CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters. CogVLM can understand and answer various types of questions, and has a visual grounding version.

Features:

  • Open Source, Apache 2.0 License
  • a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module
  • Can be further finetuned easily to reduce hallucinations even further
  • Gives better results as compared to LLaVA and NExT-GPT in some cases.

Architecture:

CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module.

Performance:

  • CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. CogVLM can also chat with you about images.

Examples:

Example 1: Extreme Ironing

CogVLM describes the unusual aspect of this image quite aptly. But, in the end it mentions that the man is wearing a yellow shirt, which emphasizes the unexpected nature of the scene in some way. This is irrelevant information and its not required in the description.

Example 2: Complex Parking Signs

Asking CogVLM multiple times about parking being allowed at the spot at a specific time, it provides different result each time. The OCR is not comparable to something like GPT-4V, as it does not read through each sign and determine if its possible to park at this spot or not. In the second output, it says that its allowed to park at this spot on Monday to Friday from 8AMto 6PM, but actually, its only allowed from 4PM till 6PM. In the third prompt, it discourages parking at this spot at Wednesday 5PM which contradicts the second output.

Example 3: Whiteboard Website

For this example as well, CogVLM gives a generic outline on how to implement the provided flow, followed by an incorrect code snippet. Hence, the output is wrong and cannot be used.

Here’s a table to quickly compare the characteristics of each of the models discussed above.

In conclusion, all the openly available models can serve as a good starting point for your application’s use case, if they are tweaked to match it. Tweaking these models would also reduce hallucinations and unwanted output data which are obviously deemed undesirable.

The out-of-the-box performance of these models cannot be compared to the behemoth that GPT-4V has become in this space. But then again, open-source model authors have limited resources to train and fine tune these models, unlike OpenAI which can throw millions of dollars into training, parts of which come from API monetizations and tie-ups with big companies like Microsoft.

On the bright side, there are companies like Meta and Microsoft-Research empowering open-source collaboration by funding projects and making their own models and innovations open-source or free-to-use. This is the way forward to AI excellence.

Thanks for reading through this enormous article. Do check out other articles and subscribe for the latest ones.

References: I highly recommend these resources if you want to delve deeper into the world of LMMs

https://www.kdnuggets.com/introduction-to-nextgpt-anytoany-multimodal-large-language-model#:~:text=NExT-GPT%27s%20best%20performance%20is,shown%20in%20the%20image%20below.

--

--