Autonomous Agent Building Blocks and Architecture Ideas
This blog post will deal with the question how to build an autonomous agent, what building blocks an architecture needs and shows the different modules in python code samples.
It’s actually super easy, if you use the tools right.
For the illustration I am going to use Open AI’s Chat GPT as LLM, LangChain and Cassio as development framework, Python 3 as development language and Astra DB (Cassandra) as vector store.
Let’s start by defining an agent as a software program that does work on behalf of a user. The fact that LLMs can mimic human-like cognition unlocks new possibilities for work that would have been difficult or impossible to implement in the past.
At the simplest level, an LLM based agent is a program wrapping ChatGPT with a text interface that can perform tasks like summarizing documents.
What we call “agent orchestration” adds a level of sophistication. For example, two specialized agents could cooperate on your code — one on code generation and one on code review. Or you could equip an agent with a tool such as an API with access to Internet search. Or make an agent smarter and more reliable with access to additional context using a technique such as Retrieval Augmented Generation (RAG).
The most powerful agents are what we call “autonomous”. They are programs that can work on chained tasks, iterate, or seek goals with limited or even no human intervention. Imagine fraud detection — an agent can adapt its behavior and recognize complex and evolving fraudulent patterns to significantly reduce false positives, ensure that legitimate transactions are not flagged as fraudulent and detect and prevent fraud in real-time by deciding which actions to execute, saving both time and money.
This document describes patterns for success when it comes to building autonomous agents.
Autonomous agents mimic human thinking, planning and execution
To build autonomous agents, we need to mimic human thinking patterns and proactively plan for task execution. During planning, LLM agents can break down large and complex tasks into smaller, manageable steps. They are also capable of self-reflecting and learning from past actions and mistakes, so as to optimize for future steps and improve final results.
The following graphics shows a simplified composition of an autonomous agent that receives inputs from users or events from applications and then processes them.
The above autonomous Agent is a complex system of different agents working hand in hand. An observer agent analyzes the received information, adds relevant context to it and then either adds it to its memory or task store. Think of a business process doing fraud analysis on credit card transaction events: using a credit card once doesn’t mean a lot, but using it twice within a short time on different continents, 100s of miles apart could mean fraud.
The first event might only lead the agent to storing the event in memory. The second event will lead the agent to create a task to analyze the observation for fraud as it has the context of the first event.
The prioritization agent will analyze and prioritize the task and optionally trigger a real-time execution of the execution agent.
The execution agent is simply responsible for doing the work (tasks & steps) — in this case to analyze the observations for fraud. It has access to more context e.g. historical transaction and credit card usage behavior of the customer via RAG, it has access to tools to use external services like Google Maps API to understand travel and distance information for the places the cards were used, or it could even interact with the customer via an app or sms or even execute a call to assist in the analysis.
How is it different from a process execution engine?
In the old, pre Gen AI, world you could build a process execution engine to address this problem. But you need to explicitly model and define all the different steps from observation to resolution, define the process execution and transition rules, deploy software that is able to handle your workload, do the analysis and execution. This was typically a hard problem to solve, implementations were complex and once you were done with implementing it, you had a fixed model and had to start from the beginning again to update the rules and processes and react to new behavior.
In the new world, using autonomous agents, you share your experience as context, this means you are giving the agent access to your knowledge to be able to identify fraud and allow it to learn new experiences and patterns; you give the agent access to tools to get more or real time information and to execute actions but you do not tell the agent how to use it, the agent will be able to decide which one to use. It is still often a hard problem, but solving it and implementing with agents has become way easier and accessible to everyone with low entry barriers.
Functional architecture blocks for autonomous agents
The following diagram is a high-level functional architecture for autonomous agents, which includes the following building blocks that we will further discuss in this blog.
Agent and Agent Development Framework
In simple terms an agent is software that you either buy off the shelf and customize or develop yourself. Developing the software completely from scratch will require you to build an abstraction layer to the low level foundational model APIs for your different use cases, from chatbot up to orchestration foundation, to build a scalable execution layer and all the extension capabilities with existing databases, external APIs and new upcoming frameworks.
Or you use an existing orchestration framework that already provides a lot of required features to manage and control LLMs, simplifies the development and deployment of LLM-based applications and improves the performance and reliability of these applications.
There are a number of orchestration frameworks available, two of the most popular being LangChain and LlamaIndex. LangChain is a powerful open-source framework designed to help developers build applications powered by language models, particularly large language models (LLM) and as of today is the most popular framework.
It simplifies development by providing standardized interfaces to LLMs prompt management, external integrations to vector stores and other tools. Developers can build applications by chaining calls to LLMs and combining it with other tools, enhancing efficiency and usability. The core idea of the library is that different components can be chained together to create more advanced use cases around LLMs.
Differentiation is not coming from LLMs — Differentiation is coming from the right use of LLMs
While Open AI revolutionized application development and IT evolution with the release of the first LLM agent for the masses (ChatGPT), it turned out that a specific LLM’s lifecycle is short because of rapid evolution of newer, better, or just more domain specific models. To take advantage of innovation to the next level the real differentiation will come from domain expertise, insight into customer needs, and ability to craft superior end user experiences.
Companies that focus merely on the technology elements risk building solutions lacking utility in the real world where LLMs become commodities. The companies that thrive as leaders in an era of commodity LLMs will be those that understand how to build engaging applications on top of the basic AI building blocks.
LangChain and Cassio for Developers’ productivity
As discussed above LangChain automates the majority of management tasks and interactions with LLMs. It provides support for memory, vector-based similarity search, advanced prompt templating abstraction, and a wealth of other features. DataStax also developed the open source tool CassIO that integrates seamlessly with LangChain, extending Cassandra-specific tools to streamline tasks for conversation memory persistence, LLM response caching, partialing of prompts, data injection from feature stores using Feast.
At the time of writing, CassIO supports LangChain, with LlamaIndex. The long-term goal of this project is to support high-scale memory for autonomous AI agents.
LangChain and Cassio are the framework for developing applications powered by language models to build powerful and differentiated applications that will not only call out to a language model via an api, but will also:
- Be data-aware: connect a language model to other sources of data
- Be agentic: Allow a language model to interact with its environment
Tools and Data Integration
Agents are not limited to interacting only with LLMs and can invoke APIs and external services that allow them to exchange information and take action towards completing assigned tasks. This feature is called “Tools”.
Tools are not confined to just language processing and can range from simple things like a calculator up to invoking an API of an external or enterprise backend service.
The following code is artifacts taken from a program based on Python and LangChain/Cassio; it represents an agent that uses OpenAI for completing a task (answering a question) and a tool to access the Google Search API to get up to date information (remember Open AI trained ChatGPT on pre-2023 data — so you will not get answers to questions regarding the year 2023).
Why would we need that tool?
See what happens if we ask ChatGPT a simple question about current information.
Conclusion: Without access to the right data, the LLM is useless.
Solution: Using an external tool like the Google search api helps us to get that information.
Step 1 — we define an Tool to access the “Search API”
Step 2 — we define a prompt template that instructs the Agent what to do. In our case we define {tools} and give access to them. The agent will load them and based on their description will understand what to use them for (e.g. “Search” — “useful for when you need to answer questions about current events”)
Step 3 — and then we use the tools and prompt_template to define our Agent — it’s like bringing it all together.
Step 4 — Now we can use the Agent with Google Search access to get answers that are relevant to current data like “How many people live in Canada as of 2023?
The example shows that the LLM understood that the Agent itself needs to use “Search” to get access to the “current” data and executed the Action using the “Search” tool and share the observation back.
This was just a simple example to make the use of an LLM more effective and to get a correct answer. Now think about an agent that has access to your enterprise’s knowledge.
Power of Tools
Tools empower an agent to do more than just language based processing; it gives the agent the power of accessing data and systems to get more information or to execute an action and the autonomy to decide which of these tools to use.
There is already a growing list of public services and APIs that can be used as Tools. Also you can access your operational data stores or vector stores to select relevant domain data and pass it as context to your agent as a tool.
This tool will have access to a vector store based on AstraDB / Cassandra and all embeddings that are stored there. In our example it might be all our product documentation so instead of asking your LLM about a specific product feature or code samples it will go against our own knowledge database and a vector search query will deliver the right answer.
Memory and Context
Providing Additional Context
To allow the agent to act in or understand your specific domain context (your products, your industry, your enterprise’s knowledge) you will not be able to use a LLM off the shelf.
This does not mean that you have to train your own model, but an existing off the shelf pre-trained model might need to be fine tuned on your domain context or it needs to be given this context using RAG. Very often we see both fine tuning and RAG be a good combination, especially when you have strong data privacy requirements — you do not want to store your company IP or customer PII in the models.
Also when new context data is frequently added, performance (latency, throughput) needs to be optimized or costs of model invocation kept low, injecting data via RAG is the preferred choice to provide context that was not present in the model’s training corpus. It pairs a retrieval model over a knowledge bank with the LLM through its input prompt space.
The following pseudo code shows how you can access a knowledge database using retrieval chains to do question completion. The data is podcast recordings which were translated to embeddings and already loaded to Cassandra/Astra DB. Now we can search within it using similarity search which is coming from the Astra DB vector search feature.
We make the knowledge base search available as a function called podcast_retriever(query) and add it to our agent’s tools.
Using these tools our agent will now retrieve the information from the knowledge database instead of doing a search on Google.
If needed, e.g. when there is no data found in the knowledge database, the agent still can access the search tool.
Your data, both for fine-tuning and RAG, comes from your domain expert systems — the CRM, ERP, knowledge management tools, mail conversations, speech recordings of service calls and many more. This could be stored in databases in formats such as videos, images, text and many more. The data is loaded into your vector store via your data integration tools or leveraging standard based APIs and data gateways.
Giving your Agents Memory
Agents per se are stateless and so need a storage layer for their short term and a long term memory. Remember our code example Agent? Without memory it would not remember what it did before — so if you ask the same question it will always start from scratch and process the whole pipeline end to end. Wouldn’t it be good to have a memory? We can add that to our agent:
Step 1 — we have to add the history usage in our Prompt template:
Step 2 — define in our agent code how to handle history in a concept called memory
Step 3 — Execute the code — see what happens if we ask the same question again:
As we see from the generated response, the Agent was aware that this question has been answered before.
As the memory could grow quickly to a large dataset, think of it as a memory stream comprising a large number of observations that are relevant to the agent’s current situation (log of questions, responses, multi user environments) a Vector search based retrieval using a low latency and high performance vector store like Astra DB will make a lot of sense.
Doing this is actually just a couple of lines of code:
And with that we can manage the history that is used as memory for our agent and have also full control over the data lifecycle and optionally defined permissions and security rules on top of it.
Conclusion
Leveraging LLMs offers an enterprise strategic advantage but is not the holy grail. Agents use a combination of an LLM and other Tools to unlock more advanced capabilities, from basic tasks like document summaries to complex “agent orchestration” that mimic human work such as real-time fraud detection, with dynamic pattern recognition.
Agents can emulate human cognition and strategize task execution, thereby continually refining their outcomes. Integrating agents via Tools into your domain experience and knowledge and interconnecting it with your enterprise workflows can supercharge your enterprise outcomes — from enhancing productivity and decision velocity to gaining competitive edges and opening new horizons.
In this landscape, enterprises that capitalize on autonomous agents position themselves for innovation and sustained leadership.
Thanks for reading. Please reach out to me if you want to see the whole thing working end-2-end or if you want to discuss this topic with me — link
References and Further Reading Material:
The following sources were inspirational input, thanks to the authors and I would recommend to read them.