How Professional Can Agentic AI Teams Get?

Empowered with GPT-4-Turbo’s 128k token limit, we explore the opportunity presented by teams of LLM agents. They achieve more than ChatGPT working alone, so how best to arrange such a team with Microsoft’s AutoGen? (spoiler, not like a human team).

Oliver Morris
13 min readNov 27, 2023

Part 3 (Part 1, Part 2, Part 4)

Over the summer of 2023, a number of packages were released for creating software development teams from agents based on LLMs:

Back in the old days of October 2023 (last month) we were permitted only ~8000 tokens in GPT-4. A team of agents could achieve little more than ChatGPT working alone before they exhausted the token limit.

Then, on Nov 6th, OpenAI released GPT-4 Turbo with a 128k context window, about 50 pages of text. This allows far greater complexity in the tasks we can tackle with agentic AI teams. AutoGen is now much more powerful and can solve tasks that ChatGPT cannot solve on its own, which we prove later in the article.

This article hands a data science problem to a team of AIs. We’re going to see how professional the automated team’s solution and code can be. But why should you care? What does it mean for LLM agents to be professional?

Agentic Opportunity = LLM + Data + Tools + Environment

Professionalism entails offering value to clients or users within a paid occupation. To seek payment necessitates

a) proof of quality, equal or better than those paying for the service
b) persistent objectivity, no hallucination, proven in real environments

If AI demonstrated such professionalism, then it could be employed beyond content generation, in consequential endeavours. AutoGen is exciting because it offers a glimour of this potential. Let’s break it down:

  1. LLM + Data

OpenAI has already introduced GPT assistants, whereby we connect our business or personal data to GPT4 and create an ‘agent’ for chatting with that knowledge. These models can participate in an AutoGen team, via the GPTAssistant class. We can grant them ‘functions’ for executing tasks on our behalf, usually after human approval. This makes them ‘agents’ in the traditional sense. OpenAI plan an AgentStore as do MetaGPT. Many businesses are adapting LLM’s with their own data to create agents. The most generic is Microsoft’s CoPilot for Office365, learning from user’s OneDrive and Sharepoint data.

2. Tools + Environment

The Executor is an equally important type of team member, much less discussed yet presenting as much opportunity. The agent, or team of agents, require an environment in which to act, to trial their tools and implement final decisions. The Executor is a wrapper around that environment.

For example, in this experiment the Executor is a python environment, but it could be an accounting system, a Jupyter notebook, a game such as chess or even Minecraft. The team wield tools in that environment, or make their own, as in software development.

The executor represents SARSA (State, Action, Reward, State, Action) logic, using rewards (or errors) to advise the team how best to move from one state to another. In our experiment the Executor advises the team of coding errors, they cannot proceed until those are rectified. Notably, AutoGen includes a ‘Teachable Agent’ who can learn from previous actions in an environment, although not used in this experiment.

AutoGen is asurprisingly powerful framework, a wrapper for LLM’s to participate in human endeavours in a manner which humans can appreciate. Such teams can automate tasks and add value in novel ways, with assistance from people (note the reversal of roles).

How Best To Structure Teams of LLM’s

Two behaviours become apparent when trying to use agentic teams for coding:

1. Agent’s desire to oblige* leads to rushed code

Agile data science is iterative, repeatedly combing over data and algorithms in order to incrementally pinpoint the optimal solution.

One might assume that asking the team to deconstruct the problem into a plan of action would foster a more considered approach. But no. An agent charged with planning will list the steps required, but other agents over eagerly attempt too many of those steps in one leap.

*In AutoGen, GPT3.5 is so obliging that it can easily enter a ‘gratitude loop’ where team members thank each other rather than progressing with work!

2. LLM agents are not human, each agent has depth over many fields of expertise

Taking inspiration from our human experience, it is tempting to configure a team with many agents, each with a specialised role. If using GPT4, or any other generalised model, then regardless of their assigned role, all the agents remain deeply knowledgeable in many fields. They all know each other’s job, ever been to a meeting like that? Soon gets confusing.

Our challenge as the PM of non-human developers is to put these abilities to work constructively.

Small Teams Engaging in Deeply Iterative Work

After much experimentation, the optimal team for this data science project was:

  1. Data Scientist
    - proposes plans and code
  2. Critic
    - critiques proposed code and plans
  3. Executor
    - a local python environment, extracts code proposed by the Data Scientist, executes the code, reports the result or errors
  4. Admin
    - a human in the loop to terminate the project when complete
    - we could instruct the team to make frequent use of this ‘human in the loop’ but this challenge intentionally minimises the human role

In our experiment there are two active team members and two shadow members, who take no active part in the coding or planning. One of the shadow members is us, the human in the loop, the other is a key concept, the Executor. First, a little background…

Generalised LLM’s Don’t Need Large Teams

Other common team members were initially included, but found to add little value:

  • Software Engineer
  • Planner
  • Business Analyst
  • Agile Project Manager / Scrum Master

We need to relearn our intuitions, this is not a human team. In effect, we are simply want GPT4 to prompt itself as effectively as possible. Every team member is already a capable coder, a well trained business analyst, an experienced planner.

Two active team members can productively prompt each other with less fuss than four. If we were to attempt improvements in the team structure then I’d focus on enhancing the critic’s role, perhaps having two critics who consult between each other. They could then propose and select the most effective prompting techniques to put to the Data scientist.

Employing techniques like ‘Rephrase and Respond’ would be a cheap and simple way to improve the work done here.

The Challenge

Verbatim as presented to the teams:

KICK OFF NOTES

The source data is at: ‘./Data/data.xlsx’

This is a list of applications which use AI. The client wants to understand and characterise the market, what sectors are likely to be over served and which are underserved.

The client needs a summary diagram or table which can comfortably fit on one page of A4.

The client is keen that :

(a) no preconceptions are imposed on the data, select algorithms which avoid unevidenced assumptions

(b) algorithm hyperparameters are optimised

Explanation

The data in the challenge is a spreadsheet of 8,000 AI products available in October 2023. Each row has the tool name, the task they address, a description of its use case and a measure of its popularity.

The source for the data is theresanaiforthat.com, worth a visit, but a reality check if you’re planning to release an AI tool. They’re being published at the rate of 500 new products per month!

For comparison, the above challenge was first completed manually, with occasional help from ChatGPT, resulting in this notebook

ChatGPT Can’t Handle It Alone

Loading the 8,000 rows of data to ChatGPT’s Advanced Analytics and submitting the above challenge leads to ‘There was an error generating a response’.

However, it does identify the same path through the problem that I used, i.e. using HDBSCAN on embeddings of text and dimensionality reduction for the sake of plotting. The AutoGen teams do not suggest HDBSCAN, clearly ChatGPT is different to the GPT-4-Turbo API.

Agile Prompts

As a developer I want to spend my time coding software, but in truth, the detail of the prompts are equally crucial to quality output from LLM’s. How do we get the team to be ‘agile’ ?

Two approaches are used:

  1. Team is presented the challenge for completion in one phase
  • The team is instructed to code iteratively, as for a Jupyter Notebook
    - Jupyter notebooks enforce the pattern of coding small chunks, viewing results and reacting accordingly.
    - They don’t actually have a notebook. It is a simple python environment
    - Work has begun on an Executor which is a real Jupyter Notebook
  • The Data Scientist and Critic are instructed to work in a loop:
    - Data scientist to propose an approach to the problem or code
    - Critic to offer suggestions for improvements
    - Data Scientist enhances output accordingly
    - Data Scientist rectifies code errors

2. Team operates over multi phase project, as inspired by Microsoft TDSP

  • Prompts to team members are same as above
  • Each phase’s prompt is prefixed with a detailed phase description, as inspired by Microsoft TDSP
    - in AutoGen we execute these phases as a series of ‘groupchats’
  • The end of each phase is shut down formally, with the team compiling a summary of steps taken (in markdown) and code (in python)
    - Functions are provided for the team to save files to disk regularly
    - Using files enables easy inspection of outputs and permits execution of the phases whenever convenient
    - each phase takes approx 10–20mins
  • The instructions for the next phase are prefixed with these summaries, briefing on the work to date
    - this summary approach saves tokens, hence costs

The Jupyter notebook to create this team, including the prompts, can be found here:

Team Experiment Jupyter Notebook

Executing the Experiment

The model is GTP-4-Turbo (1106-Preview), with no fall back onto either GPT3.5 (because it fails) or GPT4 (expensive, small context window). No fall back means we wait for GPT4 Turbo to be available, regardless rate limits and wait time. The greates fear is hitting the context window limit because AutoGen does not handle this gracefully, the entire conversation is lost, costing money and time.

Price Warning

GPT-4-Turbo is not cheap. Executing the code for all phases in this project will cost around USD 60 !

Serious Price Warning

It is tempting to allow the team full access to the previous phases’ conversation. This is done when configuring user_proxy.initiate_chat() by assigning clear_history = True. The full conversation is then prepended to all chat messages. However, this soon consumes the full 128k context window and becomes extravagantly expensive. If you leave the room and allow the team to chat, as I did, then they will happily run up a tab over USD 100.

Results

Judge for yourself by linking through to the Jupyter notebooks of the team conversations, observe the team develop the code.

An example of the multi phase team’s work can be found at the end of this article.

One Phase Jupyter Notebook
- Incurred GPT-4-Turbo API charges of USD 9

Multi Phase Jupyter Notebook
- Incurred GPT-4-Turbo API charges of USD 60

Conclusion

With 128k tokens the single phase solution is already impressive given how little effort we put into phrasing the challenge:

  • The final code executes perfectly, because the team tests as they proceed
  • Their code is easy to read as it includes notes
    - Furthermore, the conversation history documents the design decisions
  • The data science approach taken by the team is valid, even attractive
    - Although myself and GPT4AA used a different clusterer.
  • The team recommend too few clusters, but this is because they cannot ‘see’ the elbow chart which they correctly use to optimise the cluster count
    - we could fix this we GPT4 Vision.

Running the ‘One Phase’ project on multiple occasions may be sufficient to give a good selection of data science solutions for many tasks.

Whereas, the multi phase solution automatically generated multiple approaches, but evaluated them poorly. This might be a prompting issue for the evaluation phase, it may also be loss of focus on the original business requirement as the projects wears on.

The multi-phase team’s final code was the most professional (see below) but test scripts were poor. I suspect the multi phase approach is only worth the money (USD 60) if there is substantial data pre-processing, which this project did not have.

Overall, the automated code produced is becoming concerningly professional. Furthermore, we have an audit trail for the code, a written conversation demonstrating how it was derived.

Recommendations

As mentioned before, maximising the opportunity of agentic AI requires environments, furthermore, teams are simply effective prompting techniques for that environment. So:

  1. AutoGen teams need access to a stateful, iterative environment, such as a Jupyter Notebook, not simply a basic python environment
  2. AutoGen teams need access to scientific, iterative, prompting techniques, not simply whatever the developer can muster for the task

Example by GPT-4-Turbo with AutoGen in Phases

This is one of the approaches generated in the multi phase solution. It is included here as an example of the type of code the team will present for such a challenge. The team has tested the code, it executes without error.

Remember, to get this code we simply entered the 80 words in the above ‘Kick Off Notes’ and submitted a spreadsheet. That was our total contribution.

# Further refined implementation of the AIApplicationsAnalyzer class methods for review.

import os
from typing import Tuple, List, Any, Optional
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix, hstack

###

class TextVectorizer:
def __init__(self, max_features: Optional[int] = None, ngram_range: Tuple[int, int] = (1, 1), min_df: int = 2, max_df: float = 0.5):
"""
Constructor for TextVectorizer with parameters for the TF-IDF vectorizer.
"""
self._vectorizer = TfidfVectorizer(stop_words='english', max_features=max_features, ngram_range=ngram_range, min_df=min_df, max_df=max_df)
self._feature_names: List[str] = []

def fit_transform(self, data: pd.Series) -> csr_matrix:
"""
Fit the vectorizer to the data and transform the text column.
"""
tfidf_matrix = self._vectorizer.fit_transform(data)
self._feature_names = self._vectorizer.get_feature_names_out()
return tfidf_matrix

def transform(self, data: pd.Series) -> csr_matrix:
"""
Transform the text column using the fitted vectorizer.
"""
return self._vectorizer.transform(data)

def get_feature_names(self) -> List[str]:
"""
Get the feature names after vectorization.
"""
return self._feature_names

class ClusterModel:
def __init__(self, eps: float = 0.5, min_samples: int = 5):
"""
Constructor for ClusterModel with parameters for the DBSCAN clustering algorithm.
"""
self._dbscan = DBSCAN(eps=eps, min_samples=min_samples)
self._labels: List[int] = []

def fit_predict(self, data: csr_matrix) -> List[int]:
"""
Fit the clustering algorithm to the data and predict cluster labels.
"""
self._labels = self._dbscan.fit_predict(data)
return self._labels

def core_sample_indices(self) -> List[int]:
"""
Get the indices of core samples.
"""
return self._dbscan.core_sample_indices_

def components(self) -> csr_matrix:
"""
Get the core samples of the model.
"""
return self._dbscan.components_

class DataVisualizer:
def __init__(self, perplexity: int = 30, n_iter: int = 1000):
"""
Constructor for DataVisualizer with parameters for the t-SNE visualization.
"""
self._tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=n_iter)

def visualize(self, data: csr_matrix, labels: List[int], file_path: str, figsize: Tuple[int, int] = (10, 8), title: str = 't-SNE visualization of AI applications clusters') -> None:
"""
Create and save the t-SNE visualization plot.
"""
tsne_results = self._tsne.fit_transform(data.toarray())
plt.figure(figsize=figsize)
scatter = plt.scatter(tsne_results[:, 0], tsne_results[:, 1], c=labels, cmap='viridis', alpha=0.5)
plt.title(title)
plt.xlabel('t-SNE feature 1')
plt.ylabel('t-SNE feature 2')
plt.colorbar(scatter)
plt.savefig(file_path)

class AIApplicationsAnalyzer:

def __init__(self, data_path: str):
"""
Constructor for AIApplicationsAnalyzer with the path to the dataset.
"""
self.data_path = data_path
self.data: pd.DataFrame = pd.DataFrame()

def preprocess_data(self) -> pd.DataFrame:
"""
Load and preprocess the data.
"""
try:
self.data = pd.read_parquet(self.data_path)
except Exception as e:
print(f"An error occurred while reading the data: {e}")
raise
return self.data

def vectorize_data(self) -> csr_matrix:
"""
Vectorize the text columns in the data.
If vectorization fails, an empty sparse matrix is returned as a fallback.
"""
try:
task_name_vectorizer = TextVectorizer(max_features=500, ngram_range=(1, 2))
use_case_vectorizer = TextVectorizer(max_features=500, ngram_range=(1, 2))
tags_vectorizer = TextVectorizer(max_features=500, ngram_range=(1, 2))

task_name_tfidf = task_name_vectorizer.fit_transform(self.data['Task Name'])
use_case_tfidf = use_case_vectorizer.fit_transform(self.data['Use Case'])
tags_tfidf = tags_vectorizer.fit_transform(self.data['Tags'])

combined_tfidf_matrix = hstack([task_name_tfidf, use_case_tfidf, tags_tfidf])
return combined_tfidf_matrix
except Exception as e:
print(f"An error occurred during vectorization: {e}")
# Fallback to an empty sparse matrix if vectorization fails
return csr_matrix((0, 0))

def cluster_data(self, combined_data: csr_matrix) -> List[int]:
"""
Apply clustering to the vectorized data.
If clustering fails or all points are noise, a default label is assigned as a fallback.
"""
if combined_data.shape[0] == 0:
# Handle the case where vectorization failed and returned an empty matrix
print("Vectorization failed, skipping clustering.")
return []

try:
cluster_model = ClusterModel(eps=0.5, min_samples=5)
labels = cluster_model.fit_predict(combined_data)
if len(set(labels)) == 1 and -1 in labels:
print("Warning: Clustering algorithm identified all points as noise.")
# Fallback to a default label if clustering fails
labels = [0] * combined_data.shape[0]
self.data['Cluster'] = labels
return labels
except Exception as e:
print(f"An error occurred during clustering: {e}")
# Fallback to a default label if clustering fails
return [0] * combined_data.shape[0]

def visualize_clusters(self, combined_data: csr_matrix, labels: List[int], file_path: str) -> None:
"""
Visualize the clusters.
If visualization fails, the process is logged and continued without visualization.
"""
try:
# Ensure the directory for the file path exists or create it
os.makedirs(os.path.dirname(file_path), exist_ok=True)

visualizer = DataVisualizer(perplexity=30, n_iter=1000)
visualizer.visualize(combined_data, labels, file_path)
except Exception as e:
print(f"An error occurred during visualization: {e}")
# Log the error and continue without visualization

def save_results(self, output_path: str) -> None:
"""
Save the processed data with cluster labels.
"""
try:
self.data.to_parquet(output_path)
except Exception as e:
print(f"An error occurred while saving the data: {e}")
raise

# Test cases and further implementation details will be provided in the next step.

read_data = 'Phase1/Processed_Data.parquet'
output_visual = 'Phase4/Visualized.jpg'
output_data = 'Output.parquet'

def main():
analyzer = AIApplicationsAnalyzer(data_path=read_data)
analyzer.preprocess_data()
combined_data = analyzer.vectorize_data()
labels = analyzer.cluster_data(combined_data = combined_data)
analyzer.visualize_clusters(combined_data = combined_data,
labels = labels,
file_path = output_visual)
analyzer.save_results(output_path=output_data)

main()

This was Part 3. Also see Part 1, Part 2, Part 4

--

--