Which AI Team Framework Hustles Hardest?

7 min readNov 5, 2023

Starting November 1st, 2023, enterprise users of MSOffice have been granted the power to conjure excel formulas and powerpoint slides from the ether with nothing but a whisper to Office Co-Pilot.

Never has the office worker’s daily affirmation of “I am not a robot” been more necessary. In the previous article we delved into LLM teams — this time we present three of those AI team frameworks with a meaningful challenge, to analyze their own market.

Office workers goof off as the AI clicks the ‘I’m not a Robot’ on their behalf. Source: DALL.E3

The Challenge

The challenge was as follows:

a) Read data listing 8,000 AI tools currently on the market and the number of ‘likes’ they receive
b) Cluster those into a manageable number of groups (~ 20 ish)
c) Create summaries of each group (via ChatGPT)
d) Plot the clusters, preferably as a TreeMap

Teams tended to arrive at wildly different solutions which made it difficult to compare quality. So the problem was broken down into a sequence of steps for them to complete.

AutoGen was given a second chance at an unscripted response, using GPT-

4-Turbo’s new 128k context window. See the next article for the high quality of code which resulted.

To represent humanity in the competition I created a baseline solution. Notebook visible here. Naturally, that involved some help from GPT-4 Advanced Analytics, because AI coding assistance is ubiquitous these days, like driving with SatNav.

Adam Smith’s factory workers, at the dawn of a boom in labour productivity. Does the next revolution follow from “pip install autogen” ?

GPT4 Advanced Analytics: Dissolving the ‘Burden of Knowledge’

OpenAI’s GPT4 AA was released over the summer, it is pair programming within ChatGPT, not multi-agent software development. Likely most of the world’s software developers have already tried it.

I want to underline that GPT4AA proved to be an immense help with obscure python packages, with quirks of syntax and for exploring otherwise laborious options. Apprehension at wading through unfamiliar packages dissolves with such assistance.

This tool genuinely lifts the ‘Burden of Knowledge’ as described by Ben Jones. This states… “If knowledge accumulates as technology progresses, then successive generations of innovators may face an increasing educational burden”. This sentiment echoes Eroom’s law, the inverse of Moore’s law, when it comes to R&D efforts.

Amidst the praise, GPT4AA’s shortcomings are clear:

Impulsive: Does not seek guidance when confronted with multiple paths to a solution
Narrow-sighted: As the code grows GPT4 only sees the issues expressly discussed, each new problem necessitates background information. Github CoPilot has the edge because it has the app’s full code.
Constrained: OpenAI’s coding sandbox is neutered for many staple packages, notably blocking tools like Plotly for charting. GPT4 is clearly not aware of these limits.
Oversights: At times, it misses the larger narrative of the data, churning out code that quietly and inadvertently shuffles or misrepresents it.

As such, its suggestions must always be tested, the coding experience is overall more productive, but slips into 20% creation and 80% testing.

. Invention is getting harder. Source: https://patentlyo.com/patent/2019/01/inventors-trends-patenting.html

Setting Up the Teams

Microsoft AutoGen
- “a framework that enables development of LLM applications using multiple agents that can converse with each other to solve tasks…seamlessly allowing human participation”
- Microsoft Research: https://github.com/microsoft/autogen
MetaGPT
- “provides the entire process of a software company along with carefully orchestrated standard operating procedures”
- Deep Wisdom, Hong Kong: https://github.com/geekan/MetaGPT
ChatDev
- “a virtual software company that operates through various intelligent agents holding different roles”.
- Tsinghua University: https://github.com/OpenBMB/ChatDev

All of the frameworks are easily installed following instructions on GitHub. Installing and managing Docker for code execution makes things a little more complex, but safer.

All frameworks were provided with an API key for GPT3.5 and GPT4. AutoGen permits multiple API keys, which avoids hitting OpenAI’s rate limit (max calls per minute) with GPT4.

All frameworks are configured to employ their default team. This means execution for ChatDev and MetaGPT are especially easy. We simply submit an objective, even just one sentence, and watch the team ‘crack on’, like this:

python3 run.py — task “$(cat prompt.txt)” — name “project” — model “GPT_4”

Where the prompt.txt file is simply our instructions to the team, in english

AutoGen needs a little more setup, we must specifically set the prompts for the team members, but use the default text copied from GitHub page.

We then enter the bizarre phase of watching the team converse with…erm.. itself…in English.

AutoGen uses GPT4 to take on the persona of multiple team members …hence converse with itself

Results

What Framework Wins?

First Prize to…

ChatDev

Chetdev excels in its straightforwardness and produces simple, lucid code that works, because it has been executed and refined.
What’s more, its cost-effectiveness is undeniable, pairing seamlessly with GPT3.5 Turbo.
- Code sample: https://github.com/olimoz/AI_Teams_ChatDev

2. AutoGen

Autogen’s configurability and flexinility make it look like the team framework for the future
Howver, it demands a special investment in GPT4–32k, otherwise, it often exhausts GPT4’s 8k memory before task completion, worse, it rarely delivers anything when using GPT3.5
- Code sample: https://github.com/olimoz/AI_Teams_AutoGen

3. MetaGPT

MetaGPT mirrors ChatDev functionality but lacks code execution, needing extra developer hours to rectify basic oversights.
Yet, it is MetaGPT that serves up the best object-oriented code, setting a benchmark for others.
- Code sample: https://github.com/olimoz/AI_Teams_MetaGPT

Baseline: GPT4 Advanced Analytics

Deserves accolades for revolutionizing my daily work routines
- Code sample: https://github.com/olimoz/AI_Teams

Chat Comparison: Chat with GPT4 & Claude2

The prompt for the multi step challenge can also be manually issued to GPT4 and Claude 2 in chat mode.
They cannot test the code and require three subsequent prompts to apply an object oriented approach and error handling, but they do this quickly and with less installation fuss than the team frameworks
- Code sample: https://github.com/olimoz/AI_Teams

These team frameworks are all impressive, and this is research, not paid products. However, the context window issue does constrain their capabilities — for the moment. Right now, they do not outshine simply chatting with a GPT4 or Claude.

If that context window can open up, then the advantages of team frsmeworks can shine. Given that Claude2 already codes well and has the 100k limit, we may be simply waiting for its API to open this field up.

How it should be, focussed, and organised, in the style of the Apollo missions. DALL.E3

What We Want:

Its early days, so ChatDev may look like a winner right now, but things are moving fast so let’s get our order in:

An LLM with
- Qualified coding abilities in more languages than Python and Java
- Context window of Claude2 (100k tokens)
- Opensource, hosted on our own equipment for security and no rate limits
A Team Framework with
- Easy configurability of AutoGen
- Code Execution & Testing of AutoGen and ChatDev
- GitHub Integration of ChatDev
- The coding plans and object orientation of MetaGPT
- AgentStore of MetaGPT

How chatting for your code often feels, like herding cats, in the style of the Apollo missions. DALL.E3

An Arms Race in the Offing

Promoting yourself to the management of an LLM team is not the end of your problems. Directing a business involves, at minimum, two deceptively straightforward steps: setting the objective and guiding staff to achieve it. Objectives are iterative, there are false peaks. Meanwhile, team members may not grasp your bigger picture, there’s an element of herding cats.

A plethora of development methodologies and team configurations are at our disposal, traditionally tailored to human traits like career aspirations and personal ambitions.

While these frameworks can manage LLMs, one ponders what peak efficiency looks like when self-awareness is absent in machine collaborators? Given the relative affordability of LLMs, should we permit them to autonomously discover the most effective teamwork strategies, however unconventional they may appear?

Who do we become if managing a team demands no emotional acumen and grants no camaraderie? We may miss chores such as approving holidays, soothing promotion fears and undergo friendly teasing.

Inevitably, you won’t be the only self appointed AI team manager, you’ll face competition. One response is to anticipate competition, employing multiple teams to compete against each other. Such internal competition necessitates accelerating investment and team training, those who can achieve economies of scale may prevail.