How to Build Enterprise AI Apps with Multi-Agent RAG Systems (MARS)
Last year, I wrote about advanced Retrieval-Augmented Generation (RAG) as it made its way into enterprises and the emergence of Multi-Agent Software Engineering (MASE). These twin pillars of generative AI in software development have since evolved, intertwining in a way that promises to redefine the landscape of enterprise AI applications.
The Essence of an AI Agent
First, let’s explore what constitutes an AI agent? Some developers describe agents as autonomous entities capable of reasoning, action, and memory. Case in point, a ReACT agent or an agent that can reason and act. So what makes a ReACT Agent? I think of them as something that consists of three fundamental elements:
1. Intelligence: The agent’s access to Large Language Models (LLMs).
2. Knowledge: Its repository of structured and unstructured data across specific domains or topics.
3. Receptors and Effectors: The tools and APIs that enable the Agent to perceive its environment and execute tasks.
Here is a simple pictorial view of a typical agent that has Anthropic LLM for Intelligence, An exa-search and a retriever (connected to SingleStore) as tools (for access) and access to knowledge (through the retriever tool).
The Evolution of RAG in Enterprises
In the world of gen AI, last year feels like an epoch ago when developers started exploring the idea of feeding custom data into LLMs for enhanced contextual understanding. Personally, for me, I was calling it “search and retrieval” before the more well known term — Retrieval Augmented Generation (RAG) not only stuck but took the software development world by storm.
However, this rudimentary approach, often referred to as “naive RAG,” was insufficient for the layered complex needs of large enterprises. These organizations typically have a vast corpora of both structured and unstructured data and the race to build enterprise-grade AI applications was stymied by three Herculean challenges, all related to the data around RAG — accuracy, relevancy, and latency.
Turns out, this is the number one reason why most enterprises did not see a boom in production-ready applications last year.
But things have been moving fast and now we are finally seeing the coming together of a few principles that offer a glimpse into the evolution of enterprise grade software development into a brand new era.
But first, let us examine the key requirements that have emerged to make RAG applications production-ready in the enterprise landscape:
1. Accuracy: In enterprises, there is zero tolerance for hallucinations. This is especially true in critical industries like finance and healthcare where the margin for error is non-existent.
2. Relevancy: At scale, resources are expensive so information retrieval must be precise, querying only the necessary data to ensure efficiency. This also has a direct impact on the next point.
3. Latency: The RAG pipeline must operate with the swiftness of thought, completing tasks in less than a second to maintain a seamless user experience assuming that the inference from the main model still takes about one second to respond over the network.
The Alchemy of Precision and Speed
How do we achieve this alchemical blend of precision and speed? Let’s look at each of these categories and some best practices.
1. Accuracy: There are a few things that have emerged here — Fine-tuned embedding models and matching LLMs, effective evaluations, function calling for structured data, and live observability and feedback loop with Reinforcement Learning from Human Feedback (RLHF).
2. Relevancy: Most companies have data lakes, data warehouses or data lakehouses. All of this data is valuable for relevancy and mixing this with real-time data is now almost table stakes for AI applications because if the response only includes data from a point in time, not only does the response become irrelevant but also dangerously inaccurate.
3. Latency: The architecture must be optimized to ensure sub-second response times. If it takes longer, the overall experience becomes sub optimal to say the least and loses product adoption.
Now in order to build this stack there are several options for each layer of the stack but I have listed the ones that I have found the most common in talking to some companies.
Embedding Creation — Here, Nvidia Inference Microservices ( NIMs) outshines a number of different other competing solutions. With NIMs, you can either call Nvidia APIs or you can rent/buy your own H100s and deploy your models. This may not be the right choice for every company but if you are looking for speed, Nvidia is clearly the leader in this area.
What is really useful about Nemo is that you can deploy it as a server with an endpoint that then can be called as a tool from agents.
You can look at some examples and documentation for Nemo here
Semantic Caching — There are a number of data stores and frameworks you can use here but in my experience I always advocate one platform for all data requirements that can manage all data use cases at speed and scale. To me, that choice is SingleStore although for full transparency, my view is biased because I work at SingleStore.
Retrieval — In continuing with the theme above retrieval includes both structured and unstructured data which means doing both semantic search and keyword search across JSON, SQL, Key-Value etc. Here again, you can run all of this in one single query with SingleStore with sub-second response times across petabytes of data.
Security and Safety — Here again, Nvidia’s Nemo guardrails is a no-brainer. Since last year, this open source library has come a long way and can now do input and output validation, mask PII and much more with high degree of flexibility.
Evaluations — I talked about it last year and would still recommend the open source library — RAGAs. This works well with the framework we will choose to put all of this together.
The Monolith and the Modular
As you can imagine, last year each one of these layers or steps in the pipeline took on its own life and developers started constructing these systems as sequential pipelines, but this approach is fraught with limitations: rigidity, latency, and diminished accuracy and relevancy. This monolithic architecture is reminiscent of the old C or C++ applications, which, though performant, were arguably cumbersome and inflexible.
Enter the era of Multi-Agent RAG Systems (MARS).
Now imagine if we take all the different steps, tasks from above using our choice of tools and build them as specialized Agents that are experts in each domain with special tools. They can then be orchestrated to also run in parallel for even better performance and scalability, much like microservices.
This alternative offers several other advantages:
1. Maintainability: Specialized agents can tackle distinct tasks, allowing for independent updates and tool swaps.
2. Parallelism: Agents can operate concurrently, leveraging different GPUs or compute resources.
3. Optimized Resources: Workloads can be distributed across GPUs and/or CPU based commodity compute depending on the tasks, enhancing efficiency.
4. Efficiency and Accuracy: Isolating specific tasks within different agents facilitates debugging and iterative evaluation with RLHF.
The Blueprint for a Multi-Agent RAG System
Now that we have established the different tasks and the choice of libraries, we need to take into account two more things before we put this all together. First, let’s choose a framework to create a workflow.
Here, one could use an agent framework like Autogen or CrewAI but instead of using a supervisor agent and a collaboration framework that results in unpredictable and non-deterministic outcomes each time, my choice would be Langchain’s LangGraph due to two reasons — ease of use and integrations with tools like NIMs and Nemo Guardrails. In addition, I like the fact that we can construct the “graph” according to the needs of the enterprise that may differ wildly from company to company.
Second, let’s look at some additional enhancements with other Agents to make our architecture more performant and efficient.
Embedding creation: We can look at adding a pre-processing agent for data cleaning before embeddings are created.
Semantic caching: SingleStore is ideal, but we can also add a cache management agent for invalidation and updates. Langchain also offers this within its own APIs.
Retrieval: Once we have the basics down we can also add a query understanding agent that can add more context before it is sent for retrieval.
Security and Safety: Nemo Guardrail is crucial here, but a separate PII detection and redaction agent would be a very valuable addition here as well, for example, Protecto.
Evaluations: For now, let’s stick to RAGAs for now although we can use other Agents to use other frameworks and compare the results.
Given the modularity of the architecture we can now also build additional Agents that we can add, modify and remove based on the final requirements;
Query Planning Agent: Strategizes the best approach to answer queries.
Context Enrichment Agent: Enhances retrieved information with additional context.
Response Generation Agent: Crafts the final response based on enriched information.
Feedback Integration Agent: Processes user feedback for system improvement.
Logging and Monitoring Agent: Manages comprehensive logging and real-time monitoring.
Data Versioning Agent: Oversees data, embedding, and model artifact versions for reproducibility.
Orchestration Framework
For orchestrating this multi-agent RAG system, as stated above, LangGraph is going to be my choice. Its design for multi-agent systems, flexible graph structure, robust state management, integration with LangChain, visualization tools, and scalability make it an ideal choice for enterprise-grade applications.
Code Implementation
Step 1 — Build an Agent
First, let’s look at a simple agent implementation using LangGraph.
enrichment_agent = initialize_agent(
[top_selling_products_tool],
chat_model,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
Step 2 — Create tools
Notice how we are passing a tool to the agent. This is effectively another method that the Agent will call when it needs to run the query and is defined here
top_selling_products_tool = Tool(
name="Top Selling Products",
func=get_top_selling_products,
description="Get the top selling products. You can specify a limit (default is 5)."
)
This tool calls a function get_top_selling_products which in turn has the code to retrieve structured data.
def get_top_selling_products(limit=5):
query = f"""
SELECT product_name, SUM(quantity_sold) as total_sold
FROM sales
GROUP BY product_name
ORDER BY total_sold DESC
LIMIT {limit}
return db.run(query)
Step 3 — Create a workflow with graph, nodes and edges
We then create a graph that has a graph (the workflow) and nodes and edges to put this all together. We then add the nodes and edges to the workflow and run it.
Here is an example of creating a workflow by giving it a stateful graph. The AgentState is responsible for maintaining the state in the overall workflow.
workflow = StateGraph(AgentState)
Finally, we put all of this together.
# Add nodes for the agents
graph.add_node("sql_agent", sql_agent)
graph.add_node("enrichment_agent", enrichment_agent)
# Define the workflow
@graph.run
def workflow(query: str):
sql_result = graph.nodes.sql_agent(query)
enriched_result = graph.nodes.enrichment_agent(sql_result)
return enriched_result
For a full agentic RAG example, refer to this article.
Conclusion
Obviously we have scratched the surface here and as stated, there are different technologies and choices we can make to build custom Multi-Agent RAG Systems (MAR). Hopefully, this sets the foundation for you if you are looking to understand what goes into building enterprise grade AI applications and the overall framework and steps to go down the path that is not only modular but also scalable and controllable like the mircoservices architecture we have come to know.
Update — Posted the next article on this series here — The Seven-Factor AI App
✌️
Enjoyed This Content?
Thanks for getting to the end of this article. My name is Madhukar, I work at the intersection of technology (and AI) and creativity . I love to build apps and write about enterprise AI, PLG, and Marketing tech.
I am also building a course. Reach out to me on LinkedIn if you are interested to collaborate in any way.
Subscribe for free to get notified when I publish a new story.