What are GPT Agents? A deep dive into the AI interface of the future
Learn why Agents are a core part of the future of AI
Last week, I tweeted and asked who were people building cool products around the idea of GPT agents and was totally blown away by the response. I was also surprised that many seemed to not grasp what agents are nor why they will end up being so popular.
This post is designed to take you from “I have no idea what an Autonomous GPT agent is please stop using made up words” to being informed enough to have a well reasoned discussion with a friend or online about the topic.
We will cover things like what agents are, where the space is going, and tons of awesome examples of this early technology in action. As always, this post reflects my personal views and thoughts on the topic, which is why you are reading it on my personal blog. Let’s dive in!
What are GPT Agents (the basics)? 😎
Let’s get a few terms defined and then we will dive into more detail, no one likes doing vocabulary homework but setting the stage with the right terms is critical given the weird words used in AI today.
GPT = generative pre-trained transformer, this is the core ML model architecture that powers large language models (LLMs) like ChatGPT, it is a very technical term that has made its way to the masses via ChatGPT
Next, let’s look at what an agent is:
Agent = a large language model setup to run iteratively with some goals / tasks defined. This is different than how large language models (LLM’s) are “normally” used in tools like ChatGPT where you ask a question and get a single response as the answer. Agents have complex workflows where the model essentially talks to itself without a human forcing every part of the interaction
The above definition is very helpful to understand the context that ChatGPT and Agents are related but provide a very different user experience. ChatGPT takes input for a single query and returns output, it cannot do more than a single task at a time. This changed slightly with the launch of plugins in ChatGPT where the model could make use of external tools to do up to 10 requests per step. One could argue this is the first manifestation of the “agents” idea inside of ChatGPT given that the model is making the decision on what to do and whether to send additional requests.
For the sake of those who may not have tried plugins, the basic idea is that you can tell ChatGPT how the API for some external tool works and then it can write and execute code to send a request to that API based on the user query. So if you have a weather plugin, if the user asks “what is the temperature in NYC”, the model will know it can’t answer that and look at the available plugins the user has installed. Let’s say for example it sends the request and the API returns an error message that says “NYC is not a valid location, please use verbose city names and not abbreviations”, the model can actually read that error and send a new request to fix it. This is the simplest example of agents working in a production workflow today.
One last detail, I refer to agents as GPT agents simply because “agents” is a very common word and the context is often not clear. GPT agents reinforces the idea that this is somewhat related to ChatGPT and AI so you should be looking at it from a different angle. But you might hear people say agents, autonomous agents, or GPT agents which all refer to the same thing.
Quick interruption: my brother Chandler is working on a project where he creates custom hard cover AI art coffee table books for people based on the theme they want, it is so fricken cool! Check it out to support him:
Beyond the basics
Some of the projects that popularized GPT agents like AutoGPT and BabyAGI are a few of the most popular open source projects ever created. The idea of agents has truly captured the imagination of developers and people are scrambling to create tools and companies around the idea.
As a quick note, if you are a developer and want to build agent experiences, Langchain has a great library and set of tools that help developers do this without having to build everything from the ground up:
Before we look at a detailed diagram of how systems like babyAGI work, it is worth trying to simplify the idea. If you had to boil down that agents are into a single sentence one option might be: “the ability to give large language models objectives and the capacity to let the model prompt itself in a loop”. That is really all that is happening. Instead of an interaction being linear, it can be parallel (multiple prompts going at the same time trying to solve the same goal) and single player (no human required in the conversation).
Quick ask: first, thanks for reading this far. My brother (UPenn grad, Harvard student, former NASA contractor) is starting to write about AI, it would mean a ton to me if you would consider following him: https://medium.com/@crskilpatrick807
The content he is working on is truly super exciting, stay tuned, and thank you.
Here is the way babyAGI works, please take a second to let this sink in, I know the diagram can be a bit off putting but it will make more sense as we walk through it:
The process is broken into 3 main steps after you create a goal / main task for the agent:
- Get the first incomplete task
- Enrich the result and store in a vector database (no worries if you don’t know what this means)
- Create new tasks and reprioritize the task list
Let us take a concrete example and work through it together. We can start with a task being to “create a 1500 word blog post on ChatGPT and what it can do”. As the user controlling the agent, you can write that out, give as much detail as you want around requirements, and then you are done.
The model takes those requirements, and does something like the following:
sub_tasks = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are an world class assistant designed to help people accomplishh tasks"},
{"role": "user", "content": "Create a 1500 word blog post on ChatGPT and what it can do"},
{"role": "user", "content": "Take the users request above and break it down into simple sub-tasks which can be easily done."}
]
)
In this example, we are using the OpenAI API to power the agent. The system message allows you to define your agent to a certain extent, in this example we don’t really do much with it. Then, we add the user query and the critical next step which is to add a task on top of it which is to break the query up into sub tasks.
You could then take the sub tasks and in a loop query additional calls to the model to perform those subtasks all with different system messages (think different agents, maybe a writing agent, a research agent, etc). You would want something like “break this task down into simpler subtasks until you are 100% clear what needs to be done and can perform the task with high precision”, this way the model does not go into an infinite loop of adding more tasks (a common issue with agents today if you don’t do the prompt engineering right).
As an aside, you might be saying to yourself, this is going to take a lot of OpenAI API requests to make happen. You are correct, the agent workflows do consume a lot of usage so be careful when playing around. With todays limits, you probably could not do agents in ChatGPT given the message limit, even with the recent increase to 50 message per 3 hours, (see more details in the below post):
To recap where we are, we looked at the first steps in building an agent, taking the initial task and breaking it into subtasks, then having the model execute the tasks in the list. A few parts of the babyAGI flow that are worth mentioning are as follows: the “enrich results” process which could just mean something as simple as asking the model to make a task more specific and details, a form of auto-prompt engineering. They also show the results being stored in a vector database which is useful to ensure you keep track of all the steps the model has done for you throughout the process. It can be helpful to essentially see the “work” the model did to get to some end state based on your initial goal so you have some intuition as to the how.
The last interesting thing about babyAGI’s workflow is the idea of prioritizing the list, this is something we would all as humans be doing consciously or subconsciously in order do a task well. The model will by default just do things in the order it is asked so having that step will ensure the model has relevant tasks completed in a sequence that is conducive to actually finishing a task.
Agents in action 👀
We have talked a lot about the high level and low level of agents so far. But this all becomes much more exciting as soon as you see some of these agents in action. Before we dive into a bunch of examples, check out this infographic I made with some of the companies and projects being built in this space (sorry the image is so long, there’s so much being built):
Foundation Agents are what I consider to be general purpose and designed to break any task into something that works well for the agent workflow. These would be projects like babyAGI and AutoGPT. Historically, AutoGPT was the most commonly used project but they recently took down their web app and now you have to do things locally.
To see an agent in action, let’s use this great Hugging Face space which is an environment where code runs online:
Be aware that you should be VERY cautious about pasting an API key into an external website. It is worth creating a new one for the experiment and then deleting it right after so it does not leak.
Let’s start with the goal of helping me learn how to code:
You can see the first step for babyAGI is to make a task list based on my goal, it breaks “Teach me to code into Python” up into the following tasks:
- Install Python and familiarize yourself with the Python interpreter
- Learn the basic Python syntax and data types
- Understand control flow and decision making
- Learn about functions and classes
- Learn about modules and packages
… etc.
The next step is that the model writes some text to help me learn the first item. If you try this yourself, you will likely see the results are somewhat weird. For example, babyAGI ignores the first step and does a hello world program instead. I also think the UI layer in the space may be abstracting away some of the stuff that is happening. I suggest playing around here to get a feel for what else is possible. Running your first agent today is a great way to be on the cutting edge of this technology.
The future of agents 🔮
The idea of agents is not going anywhere, these are the first entities powered by general purpose AI that can solve tasks. Over time, they will get more and more sophisticated powered by more powerful models and tools. For example, you can imagine a simple customer service agent which can take someones problem and iteratively break it down, solve it, and validate the aswer. A few things are required to get there:
- Much more powerful models, GPT-4 works great but the use cases are still limited today
- Better tooling, the space we looked at is a great example of something super simple and useful, but lacking for true production use cases
- Different architecture, as the models evolve, breaking the goal into sub-tasks may no longer be the right design decision, there are many approaches like starting from the end state and working your way backwards which would be potentially just as effective.
On the tooling side of things, organizations like LangChain are launching products like LangSmith to help developers take these workflows into production:
The reality is that entire new frameworks will be born to enable this next generation of agents. It is wild to think it all really started with plugins and AutoGPT. I am deeply excited for the future and the ability to leverage world class agents to help me do the work I care about.
If you have questions about agents that were not addressed, please drop them in the comments and I will add a section at the bottom addressing them!