Sitemap
Voxel51

News, tutorials, tips, and big ideas in computer vision and data-centric machine learning, from the company behind open source FiftyOne. Learn more at https://voxel51.com

Visual Agents at CVPR 2025

19 min readMay 29, 2025

--

Why This Research Wave Matters Now

From Research to Practical Breakthroughs

The Visual Agent papers from CVPR I’m most excited about are:

How are Visual Agents Different from Vision Language Models

Output Modality: Actions vs. Text

Element Grounding

Processing High-Resolution GUI Inputs Efficiently

Managing Interleaved Vision-Language-Action History

The Action-Perception Gap

The Action Space Challenge

CLICK(x=483, y=217)
TYPE("search query")
SCROLL_DOWN(amount=0.5)

The Missing Embodiment

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Figure 2 from the paper

Impressive Results

Key Lessons for Practitioners

ShowUI: Advanced Vision-Language-Action for GUI Interactions

ShowUI Dataset parsed into FiftyOne Format. Available on Hugging Face

The ShowUI Dataset

The ShowUI Model

Key Lessons for Practitioners

GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

Figure 2 from the paper

The GUI-Xplore Dataset

The Xplore-Agent Model

Key Lessons for Practitioners

SpiritSight Agent: Advanced GUI Agent with One Look

Figure 4 from the paper

The GUI-Lasagne Dataset

The SpiritSight Model

Key Lessons for Practitioners

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

The ComfyBench Framework

The ComfyBench Benchmark

Evaluation Metrics

Key Lessons for Practitioners

The Future of Visual Agents is Moving from Perception to Interaction

This transformation arrives just as digital interfaces permeate every aspect of life. The ability to automate visual interactions promises to streamline workflows, enhance accessibility, and enable entirely new capabilities.

--

--

Voxel51
Voxel51

Published in Voxel51

News, tutorials, tips, and big ideas in computer vision and data-centric machine learning, from the company behind open source FiftyOne. Learn more at https://voxel51.com

Harpreet Sahota
Harpreet Sahota

Written by Harpreet Sahota

🤖 Generative AI Hacker | 👨🏽‍💻 AI Engineer | Hacker-in- Residence at Voxel 51

No responses yet