Introducing Apple’s Feret 7B: A Leap Forward in Multimodal Large Language Model AI

2 min readJan 7, 2024

Apple recently unveiled its new multimodal machine learning model called Feret. This represents an important step in Apple's ongoing AI development efforts to improve services like Siri. Though still early research, Feret demonstrates Apple's seriousness about advancing AI capabilities across its products.

At its core, Feret is a large language model capable of "grounding" - understanding visual inputs paired with text prompts. This allows interacting with images in context, like specifying parts of an image to condition responses. Feret was open sourced in October, but only recently became fully available.

Feret showcases Apple's progress in multimodal AI. As Apple continues leading in on-device ML, advancements in iOS translate to MacOS too, since both leverage Apple silicon. With tools like MLX, more models can run efficiently on Apple chips. Feret is likely intended for future iOS devices, while more powerful iterations may be for Macs.

Compared to Siri's limited abilities, Feret represents a huge leap forward. With quantized models like 7B, Apple may soon have AI matching large models like GPT-5, but optimized for mobile. This could translate to big iOS 18 improvements.

Feret was trained using datasets enhanced by Apple, focusing on grounding knowledge - understanding visual relationships. It uses a benchmark created to showcase its strengths, like many companies do for their models. Training leveraged Nvidia GPUs, unsurprising given Apple's compute limitations.

Interestingly, Feret borrows hyperparameters from Anthropic's Claude, showing Apple leverages existing research. Feret builds on top of Anthropic's base model, showing Apple's willingness to work with partners. This further validates Anthropic's impressive capabilities.

Feret's training allows smaller or larger versions targeting different Apple devices. The 7B parameter model looks fitted for iOS, while 13B may be for Macs. Apple's incremental, multi-year AI cycles mean rapid improvements across its product lines.

Under the hood, Feret incorporates visual sampling to understand relationships in images. This allows segmentation and identifying subjects, already seen in Apple's on-device APIs. Apple takes a practical approach focused on near-term utility over pure research.

Testing shows Feret's impressive multimodal abilities. It can ground relationships, full image context, and specifics when conditioned on image regions. This showcases real progress in contextual visual understanding.

In conclusion, Feret represents an important milestone showing Apple's commitment to leading in AI. As Apple focuses its four-year roadmap on AI, Feret proves its seriousness about advancing multimodal AI across its products. Combined with Apple's prowess in on-device deep learning, Feret underscores its goal of providing the best AI platform overall.

Introducing Apple’s Feret 7B: A Leap Forward in Multimodal Large Language Model AI

Written by Sanu Oluwaseun