Apple’s MM1

Exploring the frontier of Multimodal Large Language Models (MLLM)

4 min readMar 16, 2024

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Apple’s recent unveiling of MM1, a family of Multimodal Large Language Models (MLLMs) detailed in their research paper “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training” marks a significant milestone in the company’s AI endeavors. While the models themselves remain exclusive to Apple, the insights shared through their research offer valuable lessons for the AI community at large.

Architectural innovations and scalability

The MM1 family boasts an impressive scale, with models ranging from 3B to 30B parameters, including dense and mixture-of-experts (MoE) variants. Apple’s researchers conducted extensive ablations to identify crucial design elements, such as the importance of image resolution, visual encoder capacity, and the composition of pre-training data. For instance, increasing image resolution from 224 to 336 pixels yielded a 3% performance boost across all architectures.

Model ablations — Model / Data ablations

Interestingly, the choice of vision-language connector had a negligible impact on performance, challenging the findings of previous works like Honeybee. This insight highlights the need for continued research to validate and refine our understanding of MLLM architectures.

Harnessing the power of diverse data

One of the key takeaways from MM1’s development is the significance of pre-training data composition. Apple’s researchers demonstrated that a carefully curated mix of image-caption pairs (45%), interleaved image-text documents (45%), and text-only data (10%) was crucial for achieving state-of-the-art (SOTA) few-shot results. This finding underscores the importance of diverse data in enabling MLLMs to excel across a wide range of tasks.

Furthermore, the inclusion of synthetic data like VeCap-300M provided a notable boost to few-shot performance, showcasing the potential of data augmentation techniques in enhancing MLLM capabilities.

Pushing the boundaries of MLLM

Performance Apple’s MM1 achieves impressive results on various benchmarks, often outperforming existing models in both pre-training and fine-tuned evaluations. For example, MM1–30B surpasses models like IDEFICS-80B and Flamingo-80B on captioning tasks (COCO Captions, NoCaps, TextCaps) and VizWiz-QA in few-shot settings.

However, it’s important to note that while MM1 demonstrates strong performance, other models like Emu2–37B and LLaVA-NeXT-34B remain competitive on certain benchmarks such as VQAv2 and OKVQA. This highlights the rapid pace of progress in the field and the need for continued innovation.

Enabling few-shot learning and multi-image reasoning

One of the standout features of MM1 is its ability to perform few-shot learning and multi-image reasoning, thanks to its pre-training on interleaved image-text data. Even after fine-tuning on single-image examples, MM1–30B maintains its capacity for multi-image reasoning, as demonstrated on the MathVista benchmark.

This capability opens up exciting possibilities for MLLMs to tackle more complex, real-world tasks that require integrating information from multiple sources. However, the computational challenges associated with processing multiple high-resolution images remain an area for further research and optimization.

Balancing progress and transparency

While Apple’s MM1 represents a significant advancement in MLLM technology, it’s important to acknowledge that the models themselves are not publicly available for testing or demonstration. This contrasts with the approach taken by some other prominent AI labs, which have released open-source models or provided access to their models through APIs.

As the AI community continues to push the boundaries of what’s possible with MLLMs, it’s crucial to strike a balance between making progress and promoting transparency. Sharing insights, methodologies, and, where feasible, model access can accelerate collective progress and ensure that the benefits of these technologies are widely distributed.

Looking ahead

Apple’s MM1 is a testament to the company’s dedication to advancing the field of multimodal AI. The insights gleaned from their research offer valuable guidance for the broader AI community, highlighting the importance of architectural design, data composition, and scalability in building high-performing MLLMs.

As we look to the future, it’s clear that MLLMs will play an increasingly crucial role in enabling machines to understand and interact with the world in more natural, intuitive ways. The lessons learned from MM1’s development will undoubtedly inform and inspire further innovations in this rapidly evolving field.

While Apple’s decision to keep MM1 models proprietary may limit their immediate impact, the knowledge shared through their research has the potential to catalyze progress across the AI landscape. As other researchers and organizations build upon these insights, we can anticipate a future where MLLMs become even more capable, versatile, and accessible.

In conclusion, Apple’s MM1 represents a significant milestone in the journey towards more advanced and comprehensive multimodal AI. As the field continues to evolve, it will be crucial for researchers and practitioners to collaborate, share insights, and push the boundaries of what’s possible, ensuring that the transformative potential of these technologies is realized for the benefit of all.

Check out other interesting articles on my blog, UnfoldAI.

Thanks for reading; if you liked my content and want to support me, the best way is to —

Connect with me on LinkedIn and GitHub, where I keep sharing such free content to become more productive at building ML systems.
Follow me on X (Twitter) and Medium to get instant notifications for everything new.
Join my YouTube channel for upcoming insightful content.