Ferret-UI — advancing mobile UI understanding with Multimodal Language Models

Simeon Emanuilov
3 min readApr 10, 2024
Photo by Steve Tsang on Unsplash

Ferret-UI is an innovative multimodal large language model (MLLM) that pushes the boundaries of mobile user interface (UI) understanding. By combining an optimized architecture, rich training data, and strong referring and grounding capabilities, Ferret-UI demonstrates remarkable proficiency in comprehending and interacting with UI screens.

Link to original research: https://arxiv.org/abs/2404.05719

Ferret-UI
Ferret-UI is able to perform referring tasks (e.g., widget classification, icon recognition, OCR) with flexible input formats (point, box, scribble) and grounding tasks (e.g., find widget, find icon, find text, widget listing) on mobile UI screens. T

Key features of Ferret-UI

1. Architecture

Ferret-UI builds upon the Ferret MLLM architecture, with a key enhancement called “any resolution.” This modification allows the model to flexibly handle the varied aspect ratios commonly found in UI screens. The approach involves dividing each screen into sub-images based on its original aspect ratio. Portrait screens are split horizontally, while landscape screens are divided vertically. These sub-images are then encoded separately, enabling the model to capture fine visual details that might be lost in a single resized image.

Overview of Ferret-UI-anyres architecture
Overview of Ferret-UI-anyres architecture

2. Training

Data To equip Ferret-UI with comprehensive UI understanding skills, the researchers meticulously curated rich datasets for both elementary and advanced UI tasks.

Elementary tasks include:

  • Referring tasks: OCR, icon recognition, widget classification;
  • Grounding tasks: find text, find icon, find widget, widget listing.

These tasks serve to build a strong foundation of visual and spatial knowledge about UI elements.

Advanced tasks encompass:

  • Detailed description;
  • Perception/interaction conversations;
  • Function inference.

By training on this diverse range of tasks, Ferret-UI gains the ability to engage in nuanced discussions about UI screens, propose goal-oriented actions, and deduce the overall purpose of a screen.

3. Benchmark

To rigorously evaluate Ferret-UI’s performance, the researchers established a comprehensive test benchmark covering 11 UI tasks for both iPhone and Android screens. They also included 3 tasks from the prior Spotlight benchmark. This extensive test set allowed them to compare Ferret-UI against open-source MLLMs and the powerful GPT-4V model.

Results

Ferret-UI demonstrated superior performance across various benchmarks:

  1. Spotlight Benchmark
  • Surpassed open-source MLLMs on screen2words, widget captions, and taperception tasks

2. Elementary UI Tasks

  • Achieved 82.4% accuracy on both iPhone and Android elementary tasks
  • Significantly outperformed GPT-4V, which obtained 61.3% on iPhone and only 37.7% on Android tasks

3. Advanced UI Tasks

  • Scored an impressive 93.9% on iPhone advanced tasks and 71.7% on Android
  • Surpassed Fuyu and CogAgent models on these challenging tasks

Ablation studies

The researchers conducted ablation experiments to gain deeper insights into Ferret-UI’s performance. Key findings include:

  1. Impact of “Any Resolution”
  • Adding “any resolution” improved iPhone elementary task accuracy by 2%

2. Role of Elementary Task Training Data

  • Training on elementary tasks boosted advanced task performance by 3–9%
  • This highlights the importance of building foundational UI knowledge

Conclusion

Ferret-UI represents a significant leap forward in mobile UI understanding, combining an optimized architecture, comprehensive training data, and robust referring and grounding abilities. Its strong performance across various benchmarks showcases its potential to enable exciting new applications, such as enhancing UI accessibility for users.

By exploring the intricacies of UI screens and demonstrating a keen understanding of both individual elements and overall screen functions, Ferret-UI paves the way for more intuitive and effective human-computer interaction in the mobile domain. As research in this field continues to advance, models like Ferret-UI will play a crucial role in shaping the future of user interface design and user experience.

Thanks for reading; if you liked my content and want to support me, the best way is to —

  • Connect with me on LinkedIn and GitHub, where I keep sharing such free content to become more productive at building ML systems.
  • Follow me on X (Twitter) and Medium to get instant notifications for everything new.
  • Join my YouTube channel for upcoming insightful content.

--

--

Simeon Emanuilov

Senior Backend Engineer in Machine Learning and Big Data space | Sharing knowledge for Python & Go programming, Software architecture, Machine Learning & AI