Can AI Team Up with Itself for Our Benefit?
‘AI’ is the word of the year, but as in the films, the real revolution starts when one AI co-operates with another. ChatGPT has learnt to use tools, self reflect and can now form teams, with itself. You are the team director for a new world, who will you hire? What is your instruction?
This ‘multi-agent’ AI paradigm mirrors Adam Smith’s observations during the industrial revolution about the efficiency of specialized collaboration. As we stand at the precipice of the AI age, AI-driven teams are poised to redefine information worker productivity.
This under reported transformation is already available for your purposes. You define the team composition, select a management approach, set an objective, and watch as these models co-operate to strategize, prototype, test and deliver solutions. You can be an active manager of their team, or empower them to deliver autonomously.
This is research, not commercial product, but the tools for managing an AI team are already freely available and productive. How did we get here?
Why Should AI Co-operate with Itself?
The hyperscalers are pursuing artificial general intelligence (AGI) with ever more massive Large Language Models (LLM), meanwhile, resource constrained researchers focus on smaller LLM’s tuned to specialist uses, mobile and offline tasks.
Either way, arranging such models into teams is advantageous, especially for smaller models seeking to co-operate and compete with the hyperscalers.
Teams are not without drawbacks. We’re familiar with lengthy meetings, misaligned objectives, groupthink. The cost of these overheads versus their benefit is not always worth it
Consider recent chip and motherboard evolution. Diverse specialist chips (FPU, GPU, Cryptography) have been integrated into a singular CPU. This single minded strategy is effective where tasks are clear cut, to be executed with minimal communication overhead.
However, LLMs differ from CPUs. LLMs are tasked with navigating ambiguities: planning, self criticising and creating. These skills flourish wherever there are competing voices and co-operating specialisms. It is rumoured that GPT4 acknowledges this, adopting an ensemble ‘mixture of experts’ approach.
Individual vs team reflects autocracy vs democracy. Collaborative dialogue leads to innovation and mitigates errors of an individual. But discussion consumes time and individuals can be co-opted by external influences opposed to the interests of the team. Case in point: AI startup ‘Embra’ recently shifted away from agents, citing security and cost implications[1].
That much overlooked beast of burden, the middle manager, is at the crux of coordinating specialised teams to be more productive than a generalised individual; applying waterfall, six sigma, agile management etc. It may seem peculiar aligning statistical models with management theories, but such is the landscape of 2023. This summer saw the birth of LLM teams, thanks to pioneering AI team management frameworks.
[1] https://twitter.com/zachtratar/status/1694024240880861571
Do You Know What AI Did Last Summer?
At a high level there are three strands of research over the past six months
- Enabling technology in LLM’s
- e.g. a larger context window for long form team communication - Training individual agents with specialist skills
- e.g. enabling generalised agents to plan or fine tuning skills for specialist use - Frameworks for Teams Agents
- Co-operating towards an objective requires processes
The below graphic summarises these strands and their key research papers
We’ll take a moment to walk through the key developments.
What Was the First Step?
‘Agentic AI’ has been around for many years. Reinforcement learning (RL) has been the focus of research, as exemplified by the extraordinary success of DeepMind’s AlphaZero and MuZero. RL is a different technology to LLM’s such as ChatGPT. LLMs might lag in gaming speed, but they shine as ‘few shot’ learners, adapting without retraining and learning from mere prompts.
In March 2023, AutoGPT[1] emerged, quickly topping Github downloads. Touted as an AI agent that interprets goals in natural language and decomposes them into sub-tasks, it seamlessly integrated tools or other ML models at its disposal.
This marked a paradigm shift. LLMs weren’t solo operators anymore. In areas of shortfall, say arithmetic or factual recall, they could delegate. The LLM evolved into a strategic orchestrator, harmonizing tools and tasks.
[1] AutoGPT: Richards, 30 Mar 2023
Step 2: From Tool Mastery to a CV of Skills
In May 2023 a Stanford University research team augmented ChatGPT with a memory for skills, a table of successful and unsuccessful attempts at combining tools for a given objective.
When unleashed in Minecraft’s simulated realm, it didn’t require costly retraining like RL. Instead, leveraging GPT4’s ‘common sense’, it strategized with significantly fewer missteps than RL. Introduced as ‘Voyager’[1], this showcased how adeptly LLMs can mold their vast knowledge to diverse objectives, possibly including business processes.
[1] Voyager: Wang et al, 25 May 2023
Step 3: Skills Worth Money
Software code is simply language with strict syntax and logic, as such, LLM’s learn to code more easily than they master the nuances of human language. In August 2021, over a year before releasing ChatGPT, OpenAI issued Codex to assist with software development. ‘A skilled individual’ who, when briefed, generated code snippets. Being skilled it can charge for time, known as ‘Github CoPilot’ it sells for $4/mth/user up to $21/mth/user.
Then developments accelerated.
By March 2023, coding abilities were enhanced with the release of GPT4, which codes at an impressive level and grasps developer intent, albeit with occasional error and over confidence.
By June, OpenAI incorporated ‘functions’ into their API, allowing all developers to reliably integrate GPT3.5 and GPT4 into their software. Of course, each fraction of a word from those tools must be paid for. It makes sense, if cheaper than people.
By September, researchers had devised three management frameworks for coordinating teams of LLMs to write software and test it. These teams are simply GPT3.5 or GPT4 prompted to imagine itself as various specialists.
Notably, these frameworks aren’t exclusive to OpenAI. They can steer agents from any model, including leaner ‘7 billion parameter’ models like Falcon (by Clarifai) and Llama2 (by Meta). For perspective, GPT3.5 boasts 175 billion parameters. These smaller models, fit for high-end desktops and servers, present themselves as adaptable specialists. The target isn’t far, a useful specialist on your mobile device.
By October, LLM teams in two frameworks were self-correcting their code in secure environments (like Docker), producing operational applications.
Step 4: Skilled Individuals to a Team
A mere week post-AutoGPT, Stanford launched “Generative Agents”[1] in a Sims realm, where sociable agents emulate daily routines, from work to coffee catch-ups. These agents, essentially ChatGPT in various roles, observed, remembered, reflected, and responded, culminating in a party enterprisingly organised by agent ‘Isabella’. See at right.
Langchain developed this into this into GPTeam[2]. “Every agent within a GPTeam simulation has their own unique personality, memories, and directives, leading to interesting emergent behavior as they interact.”
Step 5: Managing the Team
We, the human user, become the software development manager. We specify the objective, who we want on our team and the collaboration methodology (waterfall / agile / etc).
AutoGen[1]
· “a framework that enables development of LLM applications using multiple agents that can converse with each other to solve tasks…seamlessly allowing human participation”
· Microsoft Research: https://github.com/microsoft/autogen
MetaGPT[2]
· “provides the entire process of a software company along with carefully orchestrated standard operating procedures”
· Deep Wisdom, Hong Kong: https://github.com/geekan/MetaGPT
ChatDev[3]
· “a virtual software company that operates through various intelligent agents holding different roles”.
· Tsinghua University: https://github.com/OpenBMB/ChatDev
The first task is to establish the team, all the frameworks allow control over what kind of team you need, or even employ a team of teams:
[1] AutoGen: Wu et al, 16-August 2023
Each agent is an LLM instance, adopting roles like CTO, Programmer, or Reviewer. Roles have a description, a prompt describing their responsibilities. Teams typically share one team memory of progress and prompts.
GPT3.5 or GPT4 are the standard choice to act as each agent, however, specialist roles might employ distinct models — Codex for programming, or Inflection AI’s pi.ai for management. When a user sets an objective, these agents collaborate to meet it. The framework’s duty is to guide this dialogue, molding the agents’ roles for optimal output.
For example, an Autogen team were asked for code to download recent pdf’s from arxiv:
Agents are capable of planning before they set down code, as per this example from MetaGPT:
Most diagrams and plans can easily be written by an LLM in code format and presented visually, as above, by tools such as PlantUML or GraphViz.
These team management frameworks can accommodate a human in the loop, as if we were an agent like any other. Hence teams can iteratively make proposals for our feedback and approval.
Conversational Frameworks
We are free to configure any conversational pattern so the team is optimised for the task. Critically, an agent can be an environment, managing the rules and state of a simulation or game in which the other agents act, for example, an agent can adopt the role of a chess board for other agents to compete in.
A New Recruitment Industry?
Fascinatingly, MetaGPT forsees opening an ‘AgentStore’, a recruitment site like ‘UpWork.com’ for chatbots. Ideally, they have been fine tuned by their experience on previous projects to become valuable additions to your own team of automated developers.
“If It Aint Tested, It’s Broken”
Trust in software hinges on transparent objectives and rigorous testing. Current frameworks allow objective-setting, though they fall short in generating insightful tests.
Hallucination and trustworthiness are correctly cited as serious problems with the generative pretrained transformer (GPT) architecture of today’s LLM’s. Yann LeCun has been particularly vocal about this[1], alternatives are under research.
There are many techniques[2] to uncover factual errors or inconsistencies, but all are fallible, there is no algorithm for truth. All of the frameworks feature self reflection and a tester agent. Autogen and ChatDev teams execute their own code in Docker, review errors and rectify. Where user feedback is required then the LLM’s can adopt personas of various users and review their own application, see RecAgent[3].
These developments are encouraging but assume we present precise SMART objectives, as opposed to vague aspirations. As with ChatGPT, our dialogue is only as enlightening as our inquiry, echoing Douglas Adams in ‘Deep Thought’s answer to life the universe and everything.
LLM teams can extend beyond coding to counsel on any subject; business strategy, agriculture, logistics, law, accounting, medicine etc. LLM’s are fallible, as per self-driving vehicles, any mission critical application would need hard evidence of performance consistently better than humans. To do this, they will each need simulation or scenario testing environments with an API. Many already exist but are partial simulations, e.g in medicine, agriculture, finance etc.
Not all tasks are so critical. MetaGPT highlights how its teams can be configured to create content; writers, illustrators, marketers and SEO specialists in co-operation to create and promote on behalf of resource strapped businesses.
[1] I Jepa: Assran et al, 14 Jun 2023
[2] RunGalileo.io, 19-Sep-2023. SelfCheckGPT, Manakul et al, 15-Mar-2023.
[3] RecAgent: A Novel User Simulation Paradigm. Wang et al 2023
More
For much more detail and a theoretical basis of agentic AI and teams of agents, see:
- ‘The Rise and Potential of LLM Based Agents: A Survey.’
- Xi et al, 19-Sep-2023 , https://arxiv.org/abs/2309.07864 - ‘A Survey on LLM Based Autonomous Agents’.
- Wang et al, 22-Aug-2023, https://arxiv.org/abs/2308.11432
Next Time
This post delved into Multi-Agent research.
In our next instalment, we’ll task three LLM teams with a tangible challenge: Getting a high level view of the 8,000 AI tools presently in the market, identifying saturated niches to avoid and what’s trending.
This was Part 1. Also see Part 2, Part 3, Part 4
References
Assran, M., Duval, Q., Misra, I. et al (2023). I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. arxiv.org/abs/2301.08243
Chen, W., Y. Su, J. Zuo, et al. (2023). AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arxiv.org/abs/2308.10848
Deng, X., Y. Gu, B. Zheng, et al. (2023). Mind2web: Towards a generalist agent for the web. arxiv.org/abs/2306.06070
Gou, Z., Z. Shao, Y. Gong, et al. (2023). CRITIC: large language models can self-correct with tool-interactive critiquing. arxiv.org/abs/2305.11738, 2023
Gravitas, S. (2023). Auto-GPT: An Autonomous GPT-4 experiment, 2023. https://github.com/Significant-Gravitas/Auto-GPT
Gur, I., H. Furuta, A. Huang, et al. (2023). WebAgent: A real-world web agent with planning, long context understanding, and program synthesis. arxiv.org/abs/2307.12856
Langchain. GPTeam A Multi-agent Simulation. (2023). https://blog.langchain.dev/gpteam-a-multi-agent-simulation/
Manakul, P., A. Liusie, M. J. F. Gales. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. arxiv.org/abs/2303.08896
OpenAI (2023). Function calling and other API updates. https://openai.com/blog/function-calling-and-other-api-updates
OpenAI (2023). GPT-4V(ision) system card. https://openai.com/research/gpt-4v-system-card
Packer, C. Fang, V. Patil, S et al. (2023). MemGPT: Towards LLMs as Operating Systems. arxiv.org/abs/2310.08560
Park, J. S., J. C. O’Brien, C. J. Cai, et al. (2023). Generative agents: Interactive simulacra of human behavior. arxiv.org/abs/2304.03442
Qian, C., X. Cong, C. Yang, et al. (2023). ChatDev: Communicative agents for software development. arxiv.org/abs/2307.07924
RunGalileo.io (2023). Chainpoll: A high efficacy method for LLM hallucination detection. https://arxiv.org/abs/2310.18344
Shen, Y., K. Song, X. Tan, et al. (2023). HuggingGPT: Solving AI tasks with chatgpt and its friends in huggingface. arxiv.org/abs/2303.17580
Shinn, N., F. Cassano, B. Labash, et al. (2023). Reflexion: Language agents with verbal reinforcement learning. arxiv.org/abs/2303.11366
Touvron, H., Martin, L., Stone, K. et al. (2023). Llama 2: Open Foundation & Fine-Tuned Chat Models. arxiv.org/abs/2307.09288
Wang, G., Y. Xie, Y. Jiang, et al. (2023). Voyager: An open-ended embodied agent with large language models. arxiv.org/abs/2305.16291
Wang, L., J. Zhang, X. Chen, et al. (2023). RecAgent: When LLM based Agent Meets User Behavior Analysis. arxiv.org/abs/2306.02552
Wu, Q., G. Bansal, J. Zhang, et al. (2023). Autogen: Enabling next-gen LLM applications via multi-agent conversation framework. arxiv.org/abs/2308.08155
Xie, T. Zhou, F. Cheng, Z et al. (2023). OpenAgents: An Open Platform for Language Agents in the Wild. arxiv.org/abs/2310.10634
Xi, Z., Chen, W., Guo, X. et al (2023). The Rise and Potential of Large Language Model Based Agents: A Survey. arxiv.org/abs/2309.07864
YeagerAI (2023). GenWorlds is the event-based communication framework for building multi-agent systems. https://genworlds.com/
Zhang, H., Du, W., Shan, J. (2023). Building Cooperative Agents Modularly with Large Language Models. arxiv.org/abs/2307.02485