Part 3: Peek Into the Guts of AI

Anatomy of an AI Multi-Agent

How do we build a useful AI agent?

Published in

Superstring Theory

6 min readMay 17, 2024

In the previous parts of our AI Multi-Agent series, we looked at why ChatGPT is not AI, and how to build a simple, “Poor Man’s RAG” agent that uses context before interacting with you. Now it’s time to cut into the guts of a pretty general AI Agent design that is built upon more that 50 years of AI Research foundation (by the way, no better introduction into the subject than the classic “Artificial Intelligence, a Modern Approach” book. I recommend any serious AI designer read it cover to cover).

In this part with three small subchapters, we will not only look at the anatomy of a good, useful AI Multi-Agent, but in subchapter 3 will discuss what components you can use for the specific parts, making things quite practical.

1. Humans vs Agents

Since we are trying to design something intelligent, we have no better example to model this something after than a human being. Here is a scheme of a typical human:

We have eyes (and other senses) to, uhm, sense the outside world. We have a mouth to communicate intelligently with each other. We have memory and (some) knowledge in our brain, and we also have some sort of reasoning engine, which makes decisions for us based on the processed senses and our knowledge. Last but not least, we have hands (and other body parts) to enact some sort of change in the external world.

But that’s exactly what we want in the AI agent as well! Can ChatGPT do all of the above? Of course not, it can only pretend to “talk” to you. However, it can serve as a key part of our multi-agent modeled after a “typical human” described above. Here it is:

This is a very general scheme, but it captures the high-level design of pretty much any AI agent we can imagine — self-driving car, autonomous robot, agent on the web, etc.

2. Agent Anatomy

Just as humans have a mouth it needs either a chat or voice interface to interact with humans (not so much with other AI agents — this can be done much more efficiently without natural language or even slower / error prone voice).

It has to have sensors to understand the environment it operates in — these may be cameras and lidars for self-driving cars or a combination of LLMs (large language models) with image-to-text models for web-based agents when they need to “understand” the websites they surf, etc.

It has to have a “processing module” that combines both the sensor and human input to “understand” what needs to be done.

It has to have a memory / knowledge base to consult depending on the inbound context — what people started calling by another hype-term “RAG” (retrieval augmented generation), which unfortunately narrows and simplifies this extremely important function.

It has to have a “brain” that plans the solution, critics it, refines it and formulates the final execution plan — in the scheme, it’s just one box, but in reality, this part is a pretty complex multi-agent itself, since we are trying to mimic a human brain by something much less complicated.

Finally, it has to have “hands” — an ability to interact with external software and systems to act on behalf of the human in the external world.

These AI multi-agents and their design have been studied for ages, e.g. in the book I recommended at the beginning of this article. Invention of LLMs and other Generative AI models makes it possible to build upon this research and implement it at the new level of technology, and finally start building AI agents that are useful in a general sense, as opposed to for specialized extremely narrow tasks.

3. Specific AI Agent “Anatomy Parts” Design

Let us move from our lousy human analogy to discussing what parts we can use today to build a versatile AI Multi-Agent like the one in the scheme. Let’s limit ourselves to an agent that operates on the internet — as opposed to the “real world”, as the latter task is quite a bit more complex.

Sensors. To sense the environment our agent operates in, it needs to be able to:

Read the websites, preferably the way humans do (as the websites have been built for humans, not for robots)
Discover and be able to use various APIs available on the internet and designed for computers

To build such sensors, we will need a combination of LLMs (large language models) and models that can understand images (or convert them to textual descriptions) plus a little bit of software code that can crawl and download data at different URLs. Then with the right prompt engineering, our LLM-based sensors will convert what they “read” into formats suitable for further processing.

Human Interaction. This is the part everyone is familiar with by now thanks to ChatGPT — you can type what you want, or say it with words, and it will be processed by another LLM-based module for further reasoning.

Process Input. This, also LLM-based, module takes whatever Sensors give it together with the current request from a Human, and tries to formulate a clear Task Request for our Plan Solution module — arguably, the main part of the “brain”. It is also absolutely critical for this module to consult the Knowledge Base via RAG and make it part of the context when formulating the Task Request.

Knowledge Base / RAG. This module stores the data and knowledge that may be relevant to our agent’s operations. This can be all kinds of publicly available data accessed via “regular” internet search, as well as so-called Vector Databases, which represent unstructured text via numerical vector embeddings. This provides a much faster search “by meaning” as opposed to simply “by keywords” and is a crucial part of our Agent.

Plan Solution. This is normally a bunch of different LLMs working together, as it has to take the Task Request, analyze what kind of resources (and “hands”) are available to the agent to execute this request, iteratively plan such an execution using various critique / step-by-step planning approaches, design sub-agents that are currently missing but needed for the task execution, and finally orchestrate execution using the “hands” or sub-agents available to our agent. This is an extremely interesting and fast-developing area of AI research, and we at Integrail focus lots of resources on making an “AI Brain” that uses “self-learning” and automatic “subagent development”. In some other, more specialized agents (e.g., in games), approaches such as Reinforcement Learning are quite useful as well.

“Hands” or Execute Actions. All of the above would be completely useless if our agent didn’t have “hands” to do something that a human asked of it. These hands are pieces of code that can call external APIs, press buttons on web-pages, or in some other way interact with existing software infrastructure.

All of the above is possible to design and create today, and times could not be more exciting for this activity. We at Integrail are building not just such agents but also a platform to design and build them very easily without any programming knowledge. If you want to participate or follow us on this journey — do subscribe to this blog or connect with me on LinkedIn, we are constantly sharing our progress and giving away GenAI Token Credits to our friends :)