Can AI Team Up with Itself for Our Benefit?

‘AI’ is the word of the year, but as in the films, the real revolution starts when one AI co-operates with another. ChatGPT has learnt to use tools, self reflect and can now form teams, with itself. You are the team director for a new world, who will you hire? What is your instruction?

Oliver Morris
11 min readNov 3, 2023

This ‘multi-agent’ AI paradigm mirrors Adam Smith’s observations during the industrial revolution about the efficiency of specialized collaboration. As we stand at the precipice of the AI age, AI-driven teams are poised to redefine information worker productivity.

This under reported transformation is already available for your purposes. You define the team composition, select a management approach, set an objective, and watch as these models co-operate to strategize, prototype, test and deliver solutions. You can be an active manager of their team, or empower them to deliver autonomously.

This is research, not commercial product, but the tools for managing an AI team are already freely available and productive. How did we get here?

Adam Smith’s Pin Workers Divide their Labour for Enhanced Productivity

Why Should AI Co-operate with Itself?

The hyperscalers are pursuing artificial general intelligence (AGI) with ever more massive Large Language Models (LLM), meanwhile, resource constrained researchers focus on smaller LLM’s tuned to specialist uses, mobile and offline tasks.

Either way, arranging such models into teams is advantageous, especially for smaller models seeking to co-operate and compete with the hyperscalers.

Teams are not without drawbacks. We’re familiar with lengthy meetings, misaligned objectives, groupthink. The cost of these overheads versus their benefit is not always worth it

One Revolutionary Era, Two Strategies. ‘Liberty leads the people’ and ‘Napoleon’s coronation’. Source: DALL.E3

Consider recent chip and motherboard evolution. Diverse specialist chips (FPU, GPU, Cryptography) have been integrated into a singular CPU. This single minded strategy is effective where tasks are clear cut, to be executed with minimal communication overhead.

However, LLMs differ from CPUs. LLMs are tasked with navigating ambiguities: planning, self criticising and creating. These skills flourish wherever there are competing voices and co-operating specialisms. It is rumoured that GPT4 acknowledges this, adopting an ensemble ‘mixture of experts’ approach.

Individual vs team reflects autocracy vs democracy. Collaborative dialogue leads to innovation and mitigates errors of an individual. But discussion consumes time and individuals can be co-opted by external influences opposed to the interests of the team. Case in point: AI startup ‘Embra’ recently shifted away from agents, citing security and cost implications[1].

That much overlooked beast of burden, the middle manager, is at the crux of coordinating specialised teams to be more productive than a generalised individual; applying waterfall, six sigma, agile management etc. It may seem peculiar aligning statistical models with management theories, but such is the landscape of 2023. This summer saw the birth of LLM teams, thanks to pioneering AI team management frameworks.

[1] https://twitter.com/zachtratar/status/1694024240880861571

Teams are powerful but can fail to cohere. Individuals are single minded but suffer over confidence. Source: DALL.E3

Do You Know What AI Did Last Summer?

At a high level there are three strands of research over the past six months

  • Enabling technology in LLM’s
    - e.g. a larger context window for long form team communication
  • Training individual agents with specialist skills
    - e.g. enabling generalised agents to plan or fine tuning skills for specialist use
  • Frameworks for Teams Agents
    - Co-operating towards an objective requires processes

The below graphic summarises these strands and their key research papers

We’ll take a moment to walk through the key developments.

What Was the First Step?

‘Agentic AI’ has been around for many years. Reinforcement learning (RL) has been the focus of research, as exemplified by the extraordinary success of DeepMind’s AlphaZero and MuZero. RL is a different technology to LLM’s such as ChatGPT. LLMs might lag in gaming speed, but they shine as ‘few shot’ learners, adapting without retraining and learning from mere prompts.

In March 2023, AutoGPT[1] emerged, quickly topping Github downloads. Touted as an AI agent that interprets goals in natural language and decomposes them into sub-tasks, it seamlessly integrated tools or other ML models at its disposal.

This marked a paradigm shift. LLMs weren’t solo operators anymore. In areas of shortfall, say arithmetic or factual recall, they could delegate. The LLM evolved into a strategic orchestrator, harmonizing tools and tasks.

[1] AutoGPT: Richards, 30 Mar 2023

LLM’s use tools to overcome their shortcomings, just as people do. Source: DALL.E3

Step 2: From Tool Mastery to a CV of Skills

In May 2023 a Stanford University research team augmented ChatGPT with a memory for skills, a table of successful and unsuccessful attempts at combining tools for a given objective.

When unleashed in Minecraft’s simulated realm, it didn’t require costly retraining like RL. Instead, leveraging GPT4’s ‘common sense’, it strategized with significantly fewer missteps than RL. Introduced as ‘Voyager’[1], this showcased how adeptly LLMs can mold their vast knowledge to diverse objectives, possibly including business processes.

[1] Voyager: Wang et al, 25 May 2023

Voyager learns How to Develop PickAxes in Minecraft. Source: https://arxiv.org/abs/2305.16291

Step 3: Skills Worth Money

Software code is simply language with strict syntax and logic, as such, LLM’s learn to code more easily than they master the nuances of human language. In August 2021, over a year before releasing ChatGPT, OpenAI issued Codex to assist with software development. ‘A skilled individual’ who, when briefed, generated code snippets. Being skilled it can charge for time, known as ‘Github CoPilot’ it sells for $4/mth/user up to $21/mth/user.

Then developments accelerated.

By March 2023, coding abilities were enhanced with the release of GPT4, which codes at an impressive level and grasps developer intent, albeit with occasional error and over confidence.

By June, OpenAI incorporated ‘functions’ into their API, allowing all developers to reliably integrate GPT3.5 and GPT4 into their software. Of course, each fraction of a word from those tools must be paid for. It makes sense, if cheaper than people.

By September, researchers had devised three management frameworks for coordinating teams of LLMs to write software and test it. These teams are simply GPT3.5 or GPT4 prompted to imagine itself as various specialists.

Notably, these frameworks aren’t exclusive to OpenAI. They can steer agents from any model, including leaner ‘7 billion parameter’ models like Falcon (by Clarifai) and Llama2 (by Meta). For perspective, GPT3.5 boasts 175 billion parameters. These smaller models, fit for high-end desktops and servers, present themselves as adaptable specialists. The target isn’t far, a useful specialist on your mobile device.

By October, LLM teams in two frameworks were self-correcting their code in secure environments (like Docker), producing operational applications.

Github CoPilot Responding to a Request for Code. Source Github CoPilot.

Step 4: Skilled Individuals to a Team

A mere week post-AutoGPT, Stanford launched “Generative Agents”[1] in a Sims realm, where sociable agents emulate daily routines, from work to coffee catch-ups. These agents, essentially ChatGPT in various roles, observed, remembered, reflected, and responded, culminating in a party enterprisingly organised by agent ‘Isabella’. See at right.

Langchain developed this into this into GPTeam[2]. “Every agent within a GPTeam simulation has their own unique personality, memories, and directives, leading to interesting emergent behavior as they interact.”

[1] Generative Agents: Park et al, 7 Apr 2023

[2] GPTeam: Langchain,16 May 2023

The ‘Generative Agents’ community hears about Isabella’s party plans. Source: https://arxiv.org/abs/2304.03442

Step 5: Managing the Team

We, the human user, become the software development manager. We specify the objective, who we want on our team and the collaboration methodology (waterfall / agile / etc).

AutoGen[1]

· “a framework that enables development of LLM applications using multiple agents that can converse with each other to solve tasks…seamlessly allowing human participation”

· Microsoft Research: https://github.com/microsoft/autogen

MetaGPT[2]

· “provides the entire process of a software company along with carefully orchestrated standard operating procedures”

· Deep Wisdom, Hong Kong: https://github.com/geekan/MetaGPT

ChatDev[3]

· “a virtual software company that operates through various intelligent agents holding different roles”.

· Tsinghua University: https://github.com/OpenBMB/ChatDev

The first task is to establish the team, all the frameworks allow control over what kind of team you need, or even employ a team of teams:

[1] AutoGen: Wu et al, 16-August 2023

[2] MetaGPT: Hong et al, 1 Aug 2023

[3] ChatDev: Qian et al, 28-Aug-2023

ChatDev’s ‘Team of Teams’. Source: https://arxiv.org/abs/2307.07924

Each agent is an LLM instance, adopting roles like CTO, Programmer, or Reviewer. Roles have a description, a prompt describing their responsibilities. Teams typically share one team memory of progress and prompts.

GPT3.5 or GPT4 are the standard choice to act as each agent, however, specialist roles might employ distinct models — Codex for programming, or Inflection AI’s pi.ai for management. When a user sets an objective, these agents collaborate to meet it. The framework’s duty is to guide this dialogue, molding the agents’ roles for optimal output.

MetaGPT’s Default Software Company Multirole Schematic. Source: https://github.com/geekan/MetaGPT

For example, an Autogen team were asked for code to download recent pdf’s from arxiv:

Autogen Agents Converse To Develop Software. Source: https://arxiv.org/abs/2308.08155

Agents are capable of planning before they set down code, as per this example from MetaGPT:

Source: MetaGPT’s Architect Plans the Development. Source: https://arxiv.org/abs/2308.00352

Most diagrams and plans can easily be written by an LLM in code format and presented visually, as above, by tools such as PlantUML or GraphViz.

These team management frameworks can accommodate a human in the loop, as if we were an agent like any other. Hence teams can iteratively make proposals for our feedback and approval.

Conversational Frameworks

We are free to configure any conversational pattern so the team is optimised for the task. Critically, an agent can be an environment, managing the rules and state of a simulation or game in which the other agents act, for example, an agent can adopt the role of a chess board for other agents to compete in.

Multiple conversational patterns can be adopted. Agents can be an ‘environment’, people or an LLM. Source: https://arxiv.org/abs/2308.08155

A New Recruitment Industry?

Fascinatingly, MetaGPT forsees opening an ‘AgentStore’, a recruitment site like ‘UpWork.com’ for chatbots. Ideally, they have been fine tuned by their experience on previous projects to become valuable additions to your own team of automated developers.

MetaGPT’s Proposed AgentStore. Source: https://twitter.com/MetaGPT_/status/1701446926195958055

“If It Aint Tested, It’s Broken”

Trust in software hinges on transparent objectives and rigorous testing. Current frameworks allow objective-setting, though they fall short in generating insightful tests.

Hallucination and trustworthiness are correctly cited as serious problems with the generative pretrained transformer (GPT) architecture of today’s LLM’s. Yann LeCun has been particularly vocal about this[1], alternatives are under research.

There are many techniques[2] to uncover factual errors or inconsistencies, but all are fallible, there is no algorithm for truth. All of the frameworks feature self reflection and a tester agent. Autogen and ChatDev teams execute their own code in Docker, review errors and rectify. Where user feedback is required then the LLM’s can adopt personas of various users and review their own application, see RecAgent[3].

These developments are encouraging but assume we present precise SMART objectives, as opposed to vague aspirations. As with ChatGPT, our dialogue is only as enlightening as our inquiry, echoing Douglas Adams in ‘Deep Thought’s answer to life the universe and everything.

LLM teams can extend beyond coding to counsel on any subject; business strategy, agriculture, logistics, law, accounting, medicine etc. LLM’s are fallible, as per self-driving vehicles, any mission critical application would need hard evidence of performance consistently better than humans. To do this, they will each need simulation or scenario testing environments with an API. Many already exist but are partial simulations, e.g in medicine, agriculture, finance etc.

Not all tasks are so critical. MetaGPT highlights how its teams can be configured to create content; writers, illustrators, marketers and SEO specialists in co-operation to create and promote on behalf of resource strapped businesses.

[1] I Jepa: Assran et al, 14 Jun 2023

[2] RunGalileo.io, 19-Sep-2023. SelfCheckGPT, Manakul et al, 15-Mar-2023.

[3] RecAgent: A Novel User Simulation Paradigm. Wang et al 2023

Factual errors are not always as black and white as they first appear. Inspired by Magritte’s “Ceci Nest Pas Un Pipe”. Source DALL.E3

More

For much more detail and a theoretical basis of agentic AI and teams of agents, see:

Next Time

This post delved into Multi-Agent research.

In our next instalment, we’ll task three LLM teams with a tangible challenge: Getting a high level view of the 8,000 AI tools presently in the market, identifying saturated niches to avoid and what’s trending.

This was Part 1. Also see Part 2, Part 3, Part 4

References

Assran, M., Duval, Q., Misra, I. et al (2023). I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. arxiv.org/abs/2301.08243

Chen, W., Y. Su, J. Zuo, et al. (2023). AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arxiv.org/abs/2308.10848

Deng, X., Y. Gu, B. Zheng, et al. (2023). Mind2web: Towards a generalist agent for the web. arxiv.org/abs/2306.06070

Gou, Z., Z. Shao, Y. Gong, et al. (2023). CRITIC: large language models can self-correct with tool-interactive critiquing. arxiv.org/abs/2305.11738, 2023

Gravitas, S. (2023). Auto-GPT: An Autonomous GPT-4 experiment, 2023. https://github.com/Significant-Gravitas/Auto-GPT

Gur, I., H. Furuta, A. Huang, et al. (2023). WebAgent: A real-world web agent with planning, long context understanding, and program synthesis. arxiv.org/abs/2307.12856

Langchain. GPTeam A Multi-agent Simulation. (2023). https://blog.langchain.dev/gpteam-a-multi-agent-simulation/

Manakul, P., A. Liusie, M. J. F. Gales. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. arxiv.org/abs/2303.08896

OpenAI (2023). Function calling and other API updates. https://openai.com/blog/function-calling-and-other-api-updates

OpenAI (2023). GPT-4V(ision) system card. https://openai.com/research/gpt-4v-system-card

Packer, C. Fang, V. Patil, S et al. (2023). MemGPT: Towards LLMs as Operating Systems. arxiv.org/abs/2310.08560

Park, J. S., J. C. O’Brien, C. J. Cai, et al. (2023). Generative agents: Interactive simulacra of human behavior. arxiv.org/abs/2304.03442

Qian, C., X. Cong, C. Yang, et al. (2023). ChatDev: Communicative agents for software development. arxiv.org/abs/2307.07924

RunGalileo.io (2023). Chainpoll: A high efficacy method for LLM hallucination detection. https://arxiv.org/abs/2310.18344

Shen, Y., K. Song, X. Tan, et al. (2023). HuggingGPT: Solving AI tasks with chatgpt and its friends in huggingface. arxiv.org/abs/2303.17580

Shinn, N., F. Cassano, B. Labash, et al. (2023). Reflexion: Language agents with verbal reinforcement learning. arxiv.org/abs/2303.11366

Touvron, H., Martin, L., Stone, K. et al. (2023). Llama 2: Open Foundation & Fine-Tuned Chat Models. arxiv.org/abs/2307.09288

Wang, G., Y. Xie, Y. Jiang, et al. (2023). Voyager: An open-ended embodied agent with large language models. arxiv.org/abs/2305.16291

Wang, L., J. Zhang, X. Chen, et al. (2023). RecAgent: When LLM based Agent Meets User Behavior Analysis. arxiv.org/abs/2306.02552

Wu, Q., G. Bansal, J. Zhang, et al. (2023). Autogen: Enabling next-gen LLM applications via multi-agent conversation framework. arxiv.org/abs/2308.08155

Xie, T. Zhou, F. Cheng, Z et al. (2023). OpenAgents: An Open Platform for Language Agents in the Wild. arxiv.org/abs/2310.10634

Xi, Z., Chen, W., Guo, X. et al (2023). The Rise and Potential of Large Language Model Based Agents: A Survey. arxiv.org/abs/2309.07864

YeagerAI (2023). GenWorlds is the event-based communication framework for building multi-agent systems. https://genworlds.com/

Zhang, H., Du, W., Shan, J. (2023). Building Cooperative Agents Modularly with Large Language Models. arxiv.org/abs/2307.02485

--

--