The Dawn of Visual Autonomous Agents

A new twist on AI agents could revolutionize the world of work

7 min readDec 12, 2023

With the runaway success of ChatGPT and the explosion of applications powered by large language models (LLMs), there is a growing interest in using them as reasoning engines to power autonomous agents capable of performing a wide variety of economically useful tasks, from writing research reports and preparing presentations to conducting scientific research and crafting entire ad campaigns. At least one expert has described LLM-based agents as potentially worth trillions of dollars for the companies that can successfully build them. While autonomous agent technology is still in its infancy, a surge of interest from established companies, startups, and amateur programmers has driven swift advances in their capabilities.

Unlike more traditional LLM-powered applications such as ChatGPT, which accept a single prompt and return a single response, autonomous agents are designed to receive a task and then automatically perform a series of actions to accomplish it. For example, an agent instructed to prepare a research report on a company might obtain that company’s financials from a web search, retrieve its mission statement from the company’s website, and then use a text editor to write and format the actual report.

A typical autonomous agent architecture. Source here

The defining feature of autonomous agents is that this process occurs with little-to-no human input — in each step, the LLM at the core of the agent decides on an action to take, receives the data that action produces, and uses the results to inform its next step.

Many efforts to build such agents have gone viral on GitHub, with projects such as BabyAGI, AutoGPT, and AgentGPT racking up thousands of stars. These programs use an LLM, such as GPT-4, as their reasoning engine, and perform their actions purely by taking in and outputting raw text. Using carefully crafted prompts, memory modules that allow agents to remember their prior actions and learn from them, and tools such as calculators and HTML requests that the agents can query through their text outputs alone, clever developers have successfully created agents that can perform many simple tasks through computer terminals and website APIs.

However, human beings don’t interact with the world through command-line interfaces — we do most of our work through desktops and browsers via graphical user interfaces (GUIs), and not every application an autonomous agent might need to access has an API. How, then, can an LLM navigate the graphical world of human tools, where seeing and clicking is an essential skill?

Enter visual agents

The past few years have seen the rise of a new type of LLM, known as multimodal large language models. Unlike traditional LLMs, which only accept raw text as inputs, multimodal LLMs can simultaneously take in — and sometimes output — other types of data, such as images, video, and sound.

Architectural example of a multimodal LLM. Source here

The introduction of multimodal LLMs holds the promise of overcoming traditional autonomous agents’ inability to interact with the human world of buttons, charts, pictures, videos, speech, and other crucial elements of our work which are beyond the reach of text-only language models. A visual agent running on a computer desktop could work much the way a human does — clicking on icons to open applications, browsing the internet using a traditional browser such as Chrome or Safari, and employing the full suite of productivity tools available to humans such as spreadsheets and word processors.

The potential impact of these systems on the economy is difficult to overstate. Consider how much of a knowledge worker’s time is spent on rote actions that require little technical skill — responding to emails, preparing charts and slides, filling out paperwork, and the like. A language model with access to the same information the worker has could easily perform many of these tasks autonomously, with only minimal input from the human to resolve ambiguities or correct errors.

As AI advances and agents become more and more intelligent, they will require less and less human supervision and approval to carry out their tasks. The final evolution of this process will likely be full-on AI employees, which will carry out the full range of functions today’s knowledge workers do with as little direct oversight. What this transition means for the future of work and power dynamics between employers and employees is another matter, but the economic effects — for good or ill — will unquestionably be tremendous.

Open-source offerings from AdeptAI

One startup in particular has taken a lead in the development of LLM-based visual agents. AdeptAI was founded in 2022 by David Luan, former vice president of engineering at OpenAI, and Ashish Vaswani and Niki Parmar, both of whom were authors on the legendary 2017 paper Attention is All You Need, which introduced the transformer architecture and launched the LLM revolution.

Adept aims to build a visual language model that can act as a virtual copilot for any task a knowledge worker might perform on their computer. A demo released by the company shows their prototype agent finding real estate online, logging customer calls, and analyzing data, all with only a brief textual prompt from the user.

Although the company’s full-scale models remain proprietary, in October 2023, Adept open-sourced Fuyu-8b, an 8 billion parameter multimodal language model designed to be used by visual agents running on a user’s desktop. Among multimodal LLMs, Fuyu is unique for lacking an image encoder separate from the text decoder — the image patches are simply fed into the first layer of the transformer along with the text inputs, allowing images of arbitrary sizes to be used.

Fuyu-8b being used for image recognition. Source here

Fuyu-8b is available for download via the HuggingFace Transformers library, allowing any user to download and run it for non-commercial purposes. Its capabilities include answering questions about visual data such as charts and web pages fed into the model, as well as recognizing and localizing specific pieces of text within an image and identifying objects found within a given region of the image.

Adept has indicated that its larger, in-house models possess even more impressive capabilities; however, even the publicly available version of Fuyu is a useful starting point for independent developers looking to experiment with visual agents.

Democratizing visual agents with GPT-4V

With OpenAI’s November release of a preview API for GPT-4 Vision (GPT-4V), the barrier to entry to building visual agents is lower than ever.

Within days of the release, the popular autonomous agent framework AutoGen, developed by OpenAI’s patron Microsoft, rolled out the ability to easily initialize a “MultimodalConversableAgent” capable of providing natural language responses to queries about images provided to it. AutoGen is intended as a system to allow multiple agents to interact with each other and work as a team, making the addition of visual agents to its repertoire an important step towards fully autonomous systems that can perform complex tasks in the real world.

Unlike truly autonomous entities, AutoGen’s agent lacks the ability to directly interact with the visual elements it is being shown. However, another Microsoft technology potentially offers the promise of closing this loop.

Set-of-Mark prompting (SoM) is a novel technique in which a pretrained visual model is used to segment an image into discrete items, each with their own label (“mark”), and a multimodal LLM — in this case GPT-4V — is then prompted to pick the element most relevant to the query. For visual agents, identifying the exact point in an image to click when navigating a GUI is critical. Unlike Fuyu, GPT-4V cannot precisely localize specific objects within an image by outputting coordinates, but with SoM, it can identify elements by their marks, giving it the ability not just to interpret images, but to interact with them, a critical function for visual agents.

GPT-4V using Set-of-Mark to locate objects in an image. Source here

The code powering SoM is publicly available on Microsoft’s GitHub and the steps for setting it up are relatively simple, making the process of initializing a visual agent easy for an experienced Python developer.

Using GPT-4V coupled with SoM, a team of researchers at Microsoft and several major universities created an agent called MM-Navigator designed to interact with smartphone GUIs. On a public benchmark of Android navigation tasks, MM-Navigator solidly outperformed several major traditional LLMs, such as GPT-3.5, PaLM 2, and Llama 2, which could only take in text descriptions of the screen. While still far from perfect, MM-Navigator demonstrates that, armed with the ability not just to see images of a GUI but to take actions on it, a multimodal LLM can act much the same way a human can — as an autonomous visual agent.

MM-Navigator being used to navigate a smartphone interface. Source here

Conclusion

The field of LLM-powered agents remains nascent, with few demonstrated real-world applications. However, the rapid progress being made in both the underlying language models and their integration into agent frameworks makes it highly likely that this state of affairs will not persist for long.

Multimodality, particularly vision, represents the major piece required to allow these agents to fully see and interact with the digital world as a human would. Given the transformative potential of autonomous entities that can perform human-like work, their impact, once they arrive, will be enormous, and the future will belong to those who can understand, build, and deploy these powerful systems.