UFO: agent-driven workflows in the Windows OS

Will G

Published in

One Cool Thing

3 min readMay 21, 2024

Paper here. Code here.

Executive Summary for Managers and Leaders:

What is it?: An architecture built around Large Language Model (LLM)-driven agents that automates open tasks and workflows using applications in the Windows OS. One agent plans what steps are required to complete a task, another executes the tasks as planned. Both agents leverage both visual and text context to plan and execute.

Why should you care?: I’m guessing pretty much everything you (and your team) do as a business happens in Windows. Even seemingly low-level administrative tasks are completed in a complex fashion across multiple applications. Think budget presentations (get values from different Excel spreadsheets, format in PowerPoint, E-mail to stakeholders via Outlook). UFO also can access often hidden Windows capabilities that elude even frequent users of the applications (e.g., “Remove All Notes” in PowerPoint). More generally: text-vision enabled agents mark a substantial step toward even more robust human/AI collaboration.

What questions should you be asking your DS/ML folks?:

In what ways could adopting something like UFO help make us more efficient, reduce costs, or open up new possibilities for us?
What are some of the pitfalls or unexpected costs of implementing an agent-based system for this workflow?
What are some methods comparable to an agent-based approach that could help us achieve what we are looking for, but at a lower cost or complexity?

Summary for Data Scientists/ML Engineers/The-Technically-Curious:

What is it?: A multi-agent workflow for automating tasks within the Windows OS. The architecture consists of an AppAgent that breaks down the query prompt into tasks and subtasks and an ActionAgent that then executes the AppAgent plan. Once the ActionAgents’ activities are complete, the AppAgent validates the output and subsqeuent tasks can be performed (or the same task is iterated). The core functionality is the GPT-Vision transformer and the pywinauto library. UFO also has the ability to execute workflows across multiple applications (e.g., ”send an E-mail containing the most recent budget numbers”).

What is cool about it?: As cool as the UFO workflow itself is, the pywinauto library was a really cool (and critical) piece of functionality. Without it, the GPT-Vision agents would have lacked the tool they needed to perform tasks within the Windows OS GUI. pywinauto is useful outside the context of LLM agents, but even more useful with them.
A secondary cool thing is seeing a practical, multimodal use case that has useful decomposition between planning and action.

Questions I am thinking about:

How much of pywinauto is tied to Microsoft applications specifically or just Windows OS applications?
How could a UFO-like approach be adapted and adopted for other multi-application workflows? (many practical tasks rely on users pulling information or taking action across multiple windows).
What are the development and adoption costs for new workflows with UFO relative to the benefits in terms of ease-of-interaction?

[1] Zhang, C. et al., UFO: A UI-Focused Agent for Windows OS Interaction, arXiv preprint 2402.07939, 2024.

UFO: agent-driven workflows in the Windows OS

Executive Summary for Managers and Leaders:

Summary for Data Scientists/ML Engineers/The-Technically-Curious:

Written by Will G