From OCR to Multi-Image Insight: Apple’s MM1.5 with Enhanced Text-Rich Image Understanding and Visual Reasoning
Multimodal Large Language Models (MLLMs) have rapidly become a focal point in AI research. Closed-source models like GPT-4o, GPT-4V, Gemini-1.5, and Claude-3.5 exemplify the impressive capabilities of advanced multimodal understanding.
This April, Apple introduced MM1, a suite of multimodal models up to 30 billion parameters, setting new benchmarks in multimodal performance with features like enhanced in-context learning and multi-image reasoning. These innovations support advanced few-shot chain-of-thought prompting.
Building on MM1’s success, Apple’s new paper, MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning, introduces an improved model family aimed at enhancing capabilities in text-rich image understanding, visual grounding, and multi-image reasoning.
MM1.5 leverages a data-centric training approach, examining the effects of diverse data combinations…