Multimodal LLMs: OpenAI’s GPT-4o, Google’s Gemini, and Meta’s Chameleon.

Examining the Latest Multimodal AI Models from OpenAI, Google, and Meta

7 min readMay 21, 2024

In the past week, significant advancements have occurred in the field of artificial intelligence (AI), marked by the introduction of OpenAI’s GPT-4o, improvements to Google’s Gemini, and the unveiling of Meta’s latest multimodal model, Chameleon. These developments all center around one central theme: Multimodal Large Language Models (LLMs).

Multimodal deep learning is a critical step towards achieving Artificial General Intelligence (AGI), as humans interact with multiple streams of data simultaneously (images, scents, tactile sensations, tastes, and sounds). Just imagine how different your capabilities would be without access to some senses. To strive towards AGI, it’s crucial to leverage all available modalities. This integration isn’t limited to AI generation; research across various domains has demonstrated benefits from fusing multiple modalities, resulting in more accurate models. As multimodal architectures continue to evolve, a vast reservoir of untapped data from different sources also awaits exploration, holding immense potential for further refining these models.

Therefore, Multimodal LLMs play a crucial role in pushing the boundaries of AI. They address the diverse ways in which humans interact with different types of data — whether it’s audio, text, images, or videos. Despite computational power remaining a significant limitation, recent advancements in hardware and optimization techniques have allowed significant progress in the multimodal field. Consequently, a considerable portion of future advancements in LLMs will likely be concentrated in this area.

Multimodal Deep Learning

In deep learning, multimodal models are commonly utilized, capable of processing various data types including text, video, and audio. This integrated approach often leads to more accurate results by leveraging insights from diverse sources to make more precise predictions. For instance, state-of-the-art action classifiers for videos utilize a combination of audio, frames, and flow. Incorporating modalities like audio provides crucial information that is not possible solely with images, such as environmental sounds, speech, or music within a video scene. These auditory cues can offer valuable context and enrich the understanding of the actions taking place. For example, when identifying a person’s action in a video while listening to music, relying solely on image frames may not provide sufficient information to identify that the subject is indeed listening to music; the subject might be misclassified as looking at a wall or engaged in some other activity. However, with both audio and image sources, we can make accurate and detailed predictions, even specifying the type of music being listened to.

Here’s an example I created on multimodal video classification: https://github.com/manuelescobar-dev/Multimodal-Egocentric-Action-Recognition

Types of fusion

Various approaches exist for fusing different modalities. Early fusion approaches train models in an end-to-end fashion, utilizing different modalities. On the other hand, late fusion approaches train a model for each modality and combine the predictions at the end. Other approaches, such as middle fusion or mid-level fusion, are also used, especially for computer vision. This is particularly useful in computer vision because it allows them to employ the best architecture for each modality to extract features and then create a multimodal classifier. For example, with both audio and image frames, an architecture such as ResNet can be used to extract visual features, which are then combined with the audio input, providing both as input to the final classifier.

Illustration of early fusion, late fusion, and middle fusion methods used by multimodal fusion networks (https://www.researchgate.net/publication/362028535/figure/fig2/AS:11431281126156761@1678559245845/Illustration-of-early-fusion-late-fusion-and-middle-fusion-methods-used-by-multimodal.jpg)

Multimodal LLMs

Although widely used, current multimodal LLMs treat each modality separately, limiting their ability to integrate information across modalities and generate multimodal documents that can contain arbitrary sequences of images and text. This separate processing also increases processing time (latency) and hinders reasoning ability. For instance, audio responses previously required multiple models working in sequence: one to convert audio to text, another to process the text, and a third to convert the processed text back to audio. On the other hand, end-to-end training significantly reduces latency and improves reasoning capabilities.

Chameleon

Developed by Meta, Chameleon is an autoregressive transformer-based model that operates with both text and images. It trains similarly to other LLMs, using a transformer architecture for unsupervised sequence-to-sequence learning. Unlike traditional models that process data types separately or use separate encoders for each, Chameleon learns to reason about images and text together. Additionally, it employs new techniques to ensure stable and large-scale training of early-fusion models, overcoming limitations that previously constrained this approach.

Architecture

Chameleon builds upon the LLaMa-2 architecture, using RMSNorm for normalization, SwiGLU for activation, and rotary positional embeddings.

Training data

Chameleon utilizes a variety of publicly available and licensed data sources for training. Here’s a breakdown:

Text-Only: 2.9 trillion tokens (also used for training Llama2 and CodeLlama)
Text-Image: 1.5 trillion tokens from 1.4 billion text-image pairs
Text-Image Interleaved: 400 billion tokens of interleaved text and image data

Tokenization

Fully-token based for both image and textual modalities.

Images: A specially trained image tokenizer breaks down 512x512 images into 1024 distinct tokens, similar to words in text.
Text: Byte-Pair Encoding (BPE) Tokenizer

Stability

Models exceeding 8 billion parameters and 1 trillion tokens faced stability issues due to normalization problems. To address this, Chameleon incorporates query-key normalization (QK-Norm), applying layer normalization to the query and key vectors within the attention mechanism.

Challenges

Chameleon’s unique mixed-modal generation also creates some challenges:

Data Dependencies per Step: Decoding changes based on whether text or images are being generated, requiring token inspection at each step.
Masking for Modality-Constrained Generation: Tokens irrelevant to the current modality (e.g., text tokens when generating images) need to be masked and ignored.
Fixed-Sized Text Units: Text generation has variable lengths, while image generation produces fixed-length blocks corresponding to an image.

Alignment (Fine-Tuning)

A lightweight alignment stage was implemented using a high-quality supervised fine-tuning dataset from LLaMa-2 and CodeLLaMa, covering various data categories. This stage refines the model’s capabilities and ensures its safety.

[1] Supervised Fine-Tuning Dataset Statistics

Evaluation

Human Evaluations

Chameleon’s ability to understand and generate mixed-modal data was assessed through human evaluations using prompts designed for everyday use cases. Chameleon’s performance was compared to GPT-4V and Google Gemini Pro, both with and without additional images. Absolute evaluations showed that Chameleon’s responses were more likely to fully complete tasks compared to competitors. In relative evaluations, Chameleon’s outputs were generally preferred, with win rates of 60.4% over Gemini+ and 51.6% over GPT-4V+. Safety testing also revealed that Chameleon’s responses were overwhelmingly safe.

Benchmark Evaluations

Evaluating multimodal models is complex due to their recent introduction. Therefore, Chameleon was benchmarked only on text-only and image-to-text tasks. Text-only tasks included commonsense reasoning, reading comprehension, math problems, and world knowledge. Image-to-text tasks included image captioning and visual question answering. The results, presented in the following images, demonstrate that Chameleon is competitive with other state-of-the-art models on many tasks.

Gemini

Similar to Chameleon, Gemini utilizes an early-fusion, token-based approach. However, Gemini employs separate image decoders, while Chameleon is a fully integrated model. This makes Chameleon more versatile for both understanding and generating multimodal data. Nevertheless, here are some immpressive key features of Gemini:

End-to-end multimodal across text, audio, and image.
Longest context window (Gemini 1.5 Pro and 1.5 Flash) of any large-scale foundation model, with a default context window of up to one million tokens.
Upcoming integration across all Google services.

Complete family of models:

Ultra: designed for complex tasks (longer inference time but better results).
Pro: suitable for general use (good trade-off between performance and latency).
Flash: a lightweight model optimized for speed and efficiency.
Nano: the smallest model for on-device tasks.

For more information watch Google’s I/O ’24:

https://www.youtube.com/watch?v=XEzRZ35urlk

GPT-4o

End-to-end multimodal across text, audio, image.
Fast audio inference time, similar to human response time in a conversation (320 milliseconds).
Faster and 50% cheaper than GPT-4 Turbo, with the same level of performance on text.
Improved tokenization.

For more information watch OpenAI’s introduction to GPT-4o: https://www.youtube.com/watch?v=DQacCB9tDaw

Conclusion

In conclusion, the recent advancements in multimodal foundational models, such as Chameleon, Gemini, and GPT-4o, mark significant progress in the field of artificial intelligence. These models represent a crucial step forward in generative AI by integrating multiple modalities like text, audio, and images. With Chameleon’s end-to-end approach, Meta’s open-source research has facilitated our understanding and potential utilization of these models in the near future. Conversely, Gemini provides a comprehensive solution, offering a wide range of models tailored to various needs, from complex tasks to lightweight on-device applications. Similarly, OpenAI maintains its leading positon with GPT-4o, a fast and performant multimodal model, making it the top choice overall. Overall, these advancements pave the way for more sophisticated AI systems capable of understanding and interacting with the world in a manner closer to human intelligence.

If you enjoyed this article, be sure to explore the rest of my LLM series for more insights and information!

References

Chameleon: Mixed-Modal Early-Fusion Foundation Models. (https://arxiv.org/pdf/2405.09818v1)
https://openai.com/index/hello-gpt-4o/
https://deepmind.google/technologies/gemini/