Apple’s MM1 Large Language Model: The Secret Weapon Redefining the Future of AI

Akshay S B
Mac O’Clock
Published in
9 min readApr 6, 2024
Photo by Medhat Dawoud on Unsplash

The other day, my sister came to me in a bit of a pickle. See, she’s studying for her MSc in Chemistry, but as part of an elective in computer science, her professor asked her and her group to research and explain this new language model called MM1 that Apple just unveiled.

Now, my sister’s no slouch when it comes to tech — she’s always been the curious one in the family, always eager to learn about the latest advancements. But when she started digging into the data blogs and research papers on MM1, she just couldn’t seem to wrap her head around all the technical jargon. That’s where I come in.

As the family’s Engineer/resident tech aficionado, I knew I had to lend a hand. You see, the plan was to take this heavy, tech-heavy content and present it in a way that everyone could understand — no more of that dry, academic stuff. After all, if my sister was struggling with it, I knew there had to be plenty of others out there in the same boat.

So, I sat her down, pulled up all the information I could find on MM1, and started breaking it down in a way that would make sense to her — and all of you. My dear readers.

Thinking about it now, this is the reason why I started this blog in the first place — to take these complex tech topics and make them accessible to everyone. After all, what good is all this incredible innovation if we can’t share it in a way that everyone can understand and get excited about?

So Let me tell you, this MM1 model is nothing short of a game-changer in the world of AI, and I’m thrilled to be the one to guide you through it

Introduced just last month on March 14th, 2024, MM1 is Apple’s groundbreaking foray into the realm of large language models. And let me tell you, this isn’t just any old language model — it’s a true marvel of engineering, blending innovative architecture with a diverse array of data sets to redefine what’s possible in the world of multimodal AI.
Now, I know what you’re thinking — if this model is so revolutionary, why haven’t we heard more about it? Well, the truth is, it’s not quite publicly available yet. But trust me, it won’t be long before Apple joins the likes of OpenAI and Google in the race for large language model supremacy. And when they do, the competition better watch out, because MM1 is poised to leave them in the dust.

Screenshot by author from the research paper

At the heart of MM1’s brilliance is its unique way of blending data sets and architectural design. This model doesn’t just rely on a single data source — oh no, it leverages a delightful mix of image captions, interleaved image-text, and text-only data. And let me tell you, the results are nothing short of mind-blowing. By properly combining text and image data, MM1 has set a new benchmark in AI performance that’s going to have its rivals green with envy.Ah, the nitty-gritty of model development — now this is where the real magic happens! Let me break down how the M1 team went about crafting this technological marvel.

In the right figure, we see the data ablations they explored. This involved carefully selecting the data sources and fine-tuning the mixing ratios of those four key data types — image captions, interleaved image-text, text-only, and even some synthetic data generated by models like GPT-4. And as they scaled up the model, they also dialed in the training hyperparameters to squeeze every last drop of performance out of it.

The process itself unfolded in four distinct steps:

1. First, they started with a small base configuration of the model and got the foundations in place.

2. Then, it was time to start tinkering — they’d change one component at a time and evaluate the impact on the model’s performance. This allowed them to really understand which pieces of the puzzle were the most crucial.

3. Armed with those insights, they were able to derive the optimal final data configuration for the M1 model.

4. And finally, they scaled that sucker up to the multi-billion parameter range, ensuring it could handle even the most demanding real-world tasks.

Now, the base model itself was built using some of the best available components — an image encoder powered by a massive vision transformer, a vision-language connector, and a diverse blend of pre-training data and language models. But it was those ablation experiments that really unlocked the true potential.

The key lessons they learned? Image resolution is king, followed closely by model size and training data composition. And when it came to the vision-language connector, it was all about the visual token count and image resolution — the specific connector type didn’t seem to matter as much.

As for the data lessons, they found that interleaved image-text data was crucial for few-shot and text-only performance, while captioning data boosted zero-shot capabilities. Text-only data helped with those same few-shot and text-only tasks, and a carefully curated mix of image and text data proved to be the sweet spot for optimal multimodal performance.

Oh, and let’s not forget the synthetic data — that stuff was a real game-changer when it came to few-shot learning. The culmination of all these insights is the awe-inspiring M1 model we see today, with its state-of-the-art architecture and unparalleled capabilities. Talk about leaving the competition in the dust!

But the MM1 team’s exploration didn’t stop there. They delved deep into the model’s architecture, examining the crucial role of the image encoder and token count. And let me tell you, the lessons they learned are going to pave the way for a whole new era of AI that can understand and interpret the world like never before.

In fact, MM1 has been presented as a family of multimodal models, each with its own unique set of parameters and capabilities. And let me tell you, the largest 30 billion-parameter version is something to behold. I can’t wait to show you how it outperforms even the likes of ChatGPT and Gemini — it’s a true testament to the leaps and bounds AI has made in its ability to reason and communicate.

Screenshot by author from the research paper

Let’s start by diving into some of MM1’s jaw-dropping capabilities, shall we? First up, we’ve got its uncanny knack for making predictions about the context of images. I mean, this model can detect the number of objects in an image, read text, and even estimate the weight of an object — all just by looking at a picture. Talk about a visual wizard!

Screenshot by author from the research paper

But MM1’s talents don’t stop there. Oh no, this model can also follow instructions and reason across multiple images, as we saw in that example where it calculated the price of beers based on the image and menu information. I’ve got to hand it to the MM1 team — they’ve really outdone themselves with this one.

Now, I know you’re probably dying to hear more about how this marvel of engineering was constructed, so let’s dive in, shall we? <IMAGE>

The development of MM1 was a meticulous process, with the team conducting countless ablations and modifications to identify the optimal configuration. And let me tell you, the lessons they learned are pure gold. From the importance of image resolution to the role of visual token count, these insights are paving the way for a whole new era of AI that understands the world in unprecedented ways.

But the real magic of MM1 lies in its state-of-the-art architecture, which features a massive vision transformer as the image encoder and a carefully curated mix of data sources. And let me tell you, when you combine that with the model’s 30 billion parameters and the power of a mixture of experts, you’ve got a recipe for success that’s going to have the competition quaking in their boots.

And the results speak for themselves, my friends. MM1’s pre-training performance is nothing short of jaw-dropping, outclassing even the most established models in tasks like captioning and visual question answering. And when you factor in the fine-tuning phase, where MM1 gets to strut its stuff on a diverse array of data sets — including some spicy synthetic data cooked up by GPT-4 — well, let’s just say the competition better start taking notes.

Screenshot by author from the research paper

Alright, let’s dive a little deeper into that performance comparison, shall we? In the final table, we see a side-by-side of the MM1 model with some heavy-hitters like OpenAI’s GPT, Forvie, and Google’s Gemini. And let me tell you, the results are nothing short of jaw-dropping.

Now, I know those metrics might look like a bunch of alphabet soup at first glance, but bear with me. The one that really caught my eye was the VQA metric — that’s short for Visual Question Answering. And wouldn’t you know it, the MM1 model absolutely blows the competition out of the water on that one.

But that’s not all — when you look at the rest of the metrics, MM1 more than holds its own against those other giants. In fact, it’s right up there with GPT and Gemini in most categories. The only real outlier is that parameter mvet, which is surprisingly low for MM1. But you know what they say, it’s not the size that counts, it’s how you use it. And let me tell you, this model is using every last bit of its power to deliver unparalleled performance.

So, if you’re looking for a model that can truly see the world through a whole new lens, MM1 is the one to watch. It’s like having a visual wizard at your fingertips, ready to tackle any challenge you throw its way. And with Apple’s backing, you just know this is the start of something big. Who’s ready to see what this marvel of engineering can do next?

Alright, now that we’ve covered the technical nitty-gritty, let’s dive into the really juicy stuff — the qualitative results.

Screenshot by author from the research paper
Screenshot by author from the research paper
Screenshot by author from the research paper

In these experiments, we get to see MM1 in action, answering questions about image content, reading text with pinpoint accuracy, and even interpreting human emotions based on visual cues. And let me tell you, the model’s responses are nothing short of astounding. From identifying the saltiness of water to evaluating the healthiness of different foods, MM1 is showcasing a level of nuanced understanding that’s going to revolutionize industries, from education to healthcare.

Screenshot by author from the research paper

So, there you have it, folks — a deep dive into the groundbreaking world of Apple’s MM1 large language model. I don’t know about you, but I’m absolutely giddy with excitement about what the future holds for this remarkable technology. With its innovative architecture, diverse data sets, and unparalleled performance, MM1 is poised to redefine how we interact with and understand the world around us.

And you can bet I’ll be keeping a close eye on this space, ready to share every juicy detail with you. In fact, I’m hoping we’ll see the MM1 model in action in the coming iOS 18 updates / Coming Mac OS— can you imagine what kind of next-level features Apple might cook up with this technology at their fingertips?

So, who’s ready to join me on this AI adventure? I can’t wait to see where it takes us! With MM1 leading the charge, I have a feeling the future is going to be nothing short of mind-blowing. Buckle up, folks — this is going to be one wild ride.

--

--

Akshay S B
Mac O’Clock

Building a Logistic Company. Here I write Self-Help, Tech and Design content membership-free✨Subscribe my newsletter https://akshaysb.substack.com