Evolution of the AI Bots: Harnessing the Power of Agents, RAG, and LLM Models

D
61 min readApr 23, 2024

--

This article is a product of my own research and synthesis of knowledge about various tools for AI bot development, sourced from online articles and Git repositories. It serves as a personal reference and aims to describe high-level patterns, pipelines, and architectural designs commonly used in 2024. The article is designed to balance a generalist technical level, suitable for newcomers to the topic, while delving deeper into theory and providing real-world examples and tools without jumping into programming. It is intended to be comprehensive enough for tech-savvy individuals who are either unfamiliar with the subject or have fragmented knowledge.

Developers now have powerful tools for creating intelligent AI applications, including Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) pipelines, and ReAct-designed Agents. This guide outlines the process of working with these tools, designs, and approaches to build AI bot applications that enhance response quality while reducing costs. I will provide URLs to additional resources; however, you can ignore them on your first read to maintain focus. All necessary information is included in this article, so there is no need to open the external links for now 😉.

I propose to split the thinking of developing general-purpose AI into two main buckets:
A) Training a model — creating a new one or modifying an existing one.
B) And then using models — That's the focus of this article and a lot we can do here.

A) Training LLMs: A Cost-Benefit Analysis

Training new LLMs from scratch is complex and costly. Instead, you can achieve better results by designing applications that leverage existing General-Pre-trained-Transformers (GPT) models without modifying them. This approach is often more cost effective and addresses the limitations of traditional LLMs, such as handling complex mathematics, recent news, commercial proprietary data, or new knowledge not initially included in the model — that’s where RAG comes in, which we will discuss a bit later in this article.

Optimizing the injection of additional context can further reduce costs, and this article will explore patterns for designing applications that effectively utilize LLM models. By following these strategies, you can improve your results without the need for fine-tuning or training new models.

Training Smaller Models and Leveraging General Models, Finetuning

Before dismissing this option as too complex and expensive, let’s explore it at a high level. Building an AI application starts with selecting a model. You can either develop a model from scratch or utilize existing ones.

Creating and fine-tuning new models whole another topic of discussion and involves distinct skill sets, and tools. While this section provides a high-level overview, this article will not focus on model fine-tuning. Instead, we will explore how to jumpstart with existing models by building applications wrapping them to enhance quality without modifying the models.

Types of data

To inject data into your application you’ll need to deal with different types of data differently, processing them differently and using different tools. There are many types of data out there. First, let’s understand what types of data there are.

High-level data can be: Structured, Unstructured, and Semi-structured. Despite the name “Structured” might seem self-explanatory, it doesn’t mean general or human understanding of a structure for instance a TXT file with a book that is nicely formatted with perfect structure, content, or index of chapters and perfectly shaped chapters with numerated pages and sound grammar, this file is NOT an example of Structured Data. In the IT world from a computer perspective Structured Data term has a very precise and specific meaning.

Structured Data in IT means a variety of tabular formatted text, this type of data has at least one table with Records (Rows), and Fields (Columns). The table should follow some sort of schema, meaning columns should contain the same type of data and not just random data stored in a table. For example in a column with the header “Net Income” each row should have a number and not a smiley face emoji. Examples:

  • Tables, Spreadsheets (e.g., XLSX, CSV, Google Sheets, LibreOffice Calc)
  • Relational (SQL) Databases (MySQL, PostgreSQL)
  • Key-value databases such as Redis (It’s a large single table with a couple of columns).
  • Note a DB contains say JSON data, then the data is semi-structured. If it contains chunks of text, the text itself is unstructured data.

Unstructured Data in the IT world is also a very specific type of data, such as Text Documents and Multimedia. Yes, TXT files are unstructured data.

Text Documents:

  • Word Processing Files (e.g., DOCX, RTF, TXT, ePub, etc.)
  • PDF Files
  • HTML Files / Web Pages

Multimedia

  • Images (e.g., JPEG, PNG)
  • Audio, Voice (e.g., MP3, WAV)
  • Video (e.g., MP4, AVI)

Semi-structured Data:

  • Emails, have some structure in them: Headers, Body, and Attachments
  • Same with JSON, XML, and YAML Files
  • NoSQL Databases: MongoDB (MQL), Cassandra (CQL), Neo4j (Cypher), Nebula (GraphQL, Cypher)
Types of Data & Backend Tools to use

I would like to emphasize the versatility of the Knowledge Graph Databases (KGDB). It is proficient in handling various types of data, including unstructured text data, such as a book, semi-structured data, and structured tables, such as CSV files. Furthermore, some KGDB such as Neo4j, SurrealDB, and FalkorDB can function as a Vector database, enhancing its adaptability and use-cases with AI. Both Neo4j and FalkorDB support Cypher Query language and Bolt access protocol.

In comparison to Relational Databases, KGDB may demonstrate superior performance, particularly when numerous SQL Join operations are anticipated. The KGDB does not rely on indexing as Relational DBs. Therefore, if there is a need to extract more than five levels of deep hierarchical join operations with parent-to-child relationships, or if queries are to be formed that involve more than 500 junction IDs, KGDB could be a more efficient choice. Also, KGDB is capable of full-text index search, making it easier to find data without knowing the schema.

However, it is important to note that despite its versatility, KGDB does have certain limitations. These should be taken into consideration when deciding on its applicability for specific use cases, for example:

  • If you planning to store only Structured data, Relational DB might provide better performance, versatility and most importantly SQL language to query might be more familiar than Cypher.
  • Some KGDBs such as Neo4j, FalkorDB, and SurrealDB have a collection of essential vector operations serving as a Vector DB. Though might be not as quick retrieving data as specialized Vector DBs and limited in capabilities as Qdrant, Milvus, or Weavite. Also, Neo4j has a dimensionality limit of 4,096 which is quite big and should suffice for almost all tasks, though in some cases might not be applicable. Neo4j currently cannot use the vector index in combination with the pre-filtering approach; you can only apply post-filtering in combination with the vector index. While SurrealDB doesn’t support Cypher Query language closely resembling traditional SQL but not exactly the same and may have poor documentation.
  • KGDB might not be suitable for multimedia content instead probably better to use Vector DB or something else.

You’ll be storing all these types of data in different places and backends, for instance, YAML files can be stored in the etcd cluster and be available via k8s API, and emails and XML are typically stored in some sort of server and also available via API. Text files you probably will be storing in KGDB or a Vector DB or both. While table data you probably will be storing in a Relational DB.

Evolution: v1.0

AI bot development techniques, designs, and architectures are rapidly evolving, increasing abstraction in data manipulation while enhancing the resulting quality. This progression is leading to more human-like responses and beyond.

The first major step was the development of Large Language Models (LLMs). The second step involved creating Retrieval-Augmented Generation (RAG) applications wrapping LLMs to improve answers by injecting relevant information, helping text, and instructions thereby addressing the weaknesses of LLMs using various functions and tools. The third step is the expansion of RAG into Agents that can interact with the real world and perform tasks using tools, not just answer questions in a chat.

Within each of these steps, there is flexibility in building applications, allowing for experimentation, adaptation, and improvement based on individual use cases. For simplicity, I’ll refer to these steps as v1.0, v2.0, v2.5, v3.0, and v3.5. While not official terminology, this helps organize the information and structure this article.

Inference: An engine capable of executing your model. Do not confuse with RAG app.
Serve: The process of exposing the inference engine as a service via an API endpoint over the HTTP protocol.
LLM Model: Think of it as a parrot capable of repeating words, rearranging them, or even answering questions. It’s a neural network algorithm stored as a file and executed by the inference engine. When executed with an input prompt, the model can respond accordingly. LLMs are notable for their ability to achieve general-purpose language understanding and generation, acquired by learning statistical relationships from documents during the training process.
RAG Application — code written in a programming language often in Python or Node.JS. RAG app will be using the inference engine via API.

Prompt Engineering

The quality of context produced by searching embeddings in your Vector DB, might not always be effective to a Chat LLM. Some models follow instructions better than others, and depending on the specifics of the model, certain prompts might work effectively while others don’t. Therefore, it’s crucial to find prompts that work best with your LLM model providing examples and instructions for your LLM model. There are specialized resources that share prompt template examples that might be effective.

Prompt Flow is the opposite of a hardcoded prompt. It involves creating multiple different prompts and variants that can be plugged into LLMs for different scenarios, dynamically adapting to the user prompt ultimately providing better results. An example, of a framework for algorithmically optimizing LLM prompts instead of manually prompting is DSPy.

Examples of resources that share prompts & rating

Infini-Attention Models: Challenges, Solutions, and Compromises

The small, fixed-size context window in vanilla models poses significant difficulties in processing large documents and texts. Google’s proposed solution, the so-called infinite attention mechanism, addresses this by expanding the context window to an impressive 1M or even 10M tokens. This advancement is beneficial for certain use cases.

This is great news for some scenarios and is capable of addressing some niche issues, though processing 10M tokens in each request might be possible now, it might be an overkill resulting in the price of such an approach being prohibitively high as well as latency of response and slow speed. Consequently, Vector Databases and other data management techniques of juggling your data with RAG remain relevant and essential, as they are not becoming obsolete by Infini-Attention models. Instead, Infini-Attention is an additional tool in the RAG toolbox.

Infini-Attention Models: A Closer Look

As with any technology not being magic or a silver bullet, details matter. It’s important to find which niche issues it can solve and where to look for shortcomings. The problem with the 10M context window as I already said is price and speed. It’s important to understand how the Infini-attention mechanism is built beneath the surface of promising new tech to understand any other nuanced details and implications.

While previous attention mechanisms in the vanilla models were trying to analyze all tokens to all others, creating so-called quadratic complexity, leading to the consumption of large amounts of memory infini-attention mechanism aims to solve this problem by processing data linearly. Linear processing is a part of the reason why such models are capable of dramatically reducing the amount of consumed memory. Another part of the solution is compressive memory. It is important to understand the type of compression and how it works. First of all, you need to know this is a kind of lossy compression, when you reduce the quality, or you can say reduce the resolution of data. Also limited in size memory used with a fixed number of parameters to store and recall information from.

Infini-Attention models handle large texts by dividing them into sequences, which are processed using “local-self attention,” and the “dot-product attention” used in vanilla models. Segments are basically chunks of text, but this term is applied to the splitting of the context for the LLM. Infini-Attention models introduce an additional memory known as hidden global compressive memory which serves as a clue between segments. This global memory integrates information across all sequences, enhancing the model’s ability to manage extensive texts maintaining overall global knowledge across segments.

Global compressive memory is “lossy” because it reduces resolution and quality and has a limited size. As each text segment is processed with local-self attention, the global memory is continuously updated, integrating new information and modifying parameters in the global memory with the current segment information. This process repeats for each sequence, updating global memory linearly keeping the overall context of the document. Global compression memory maintains a fixed number of parameters, to store and recall information with bounded storage and computation costs.

Infini-attention models offer the ability to process each segment with the same quality as traditional attention mechanisms while considering a global memory updated with knowledge from previous sequences. The model adjusts its parameters in the global memory to best capture the overall essence of the information. However, this compression can lead to some detail loss due to its lossy nature and the fixed size of the memory similar to the vanishing or exploding gradients problem. This may introduce a recency bias, where more recent information is represented with higher fidelity than older information, potentially reducing attention to earlier parts of the sequence. This trade-off is necessary to manage vast amounts of text without overwhelming memory capacity. Despite these compromises, Infini-attention models demonstrate promising results.

Compression is an intuitive and necessary solution for managing large amounts of data. Limiting global memory is a price to pay when processing infinite or vast amounts of information. Without this limitation, even if the information from each sequence is significantly reduced, the global hidden memory would still grow infinitely, making the system unmanageable. So keeping the global hidden memory limited size is also a necessity.

The use of compression in infini-attention models may lead to some degree of detail loss due to its lossy nature and the finite size of the memory. This could result in a recency/position bias, where more recent or more often repeated information is represented with higher fidelity than older information, potentially leading to less attention being paid to earlier parts of the sequence. This is a trade-off we need to pay for processing vast amounts of text without overwhelming memory capacity.

The resolution quality of infini-attention models, designed to reduce memory footprint via constant updates with each new sequence, inevitably leads to some detail loss and less prominent results. Unlike vanilla models, the infini-attention models trained on long data chunks, research and time will show how this affects the quality of the results produced. The exact impact of these trade-offs and the nature of lossy compression will become clearer over time.

While it’s difficult to quantify the exact impact of these trade-offs and the nature of lossy compression at this stage, the current advancements show promising results that were previously unattainable. Despite the inherent compromises, Infini-attention models likely represent a significant improvement over past capabilities, offering substantial potential in managing large-scale text processing tasks. Some might assume the RAG would not be needed anymore if the infini-attention models could get rid of their disadvantages, though it is likely not the case and you can build LongRAG with emphasis on retrieved chunks being much bigger than fed to the (reader) models that support large context windows. LongRAG removes the need for reranking and demonstrates a good performance with public information from Wikipedia.

B) RAG: Vehicle powered by LLM v2.0

LLMs often hallucinate, produce inaccurate answers, or simply do not know something. That is why the Retrieve Augment Generate (RAG) Pipeline was invented: to direct, check, and inject new information, provide guiding text and instructions, ensure better results, and restrict and safeguard from hallucinations. RAG can be referred to as “approach”, “technique”, or “pattern”, though I will settle with the terms “RAG Application” or “RAG Pipeline” which I might use interchangeably in this article. If you need to insert links to the original documents into the LLM's response, it would not be able to do that, for example, look at Bing/Copilot which not only answers your question but also provides URL links to the documents it used. LLMs do not know proprietary information, recent news, or knowledge not initially included in the model. That’s where the RAG application helps.

General-purpose LLMs are designed to minimize average errors across their training data, which inherently leads to hallucinations. This phenomenon is not a flaw but a feature of these models. They excel in a wide range of tasks but perfect at none. Their ability to generate fluent human language stems from extensive internet exposure. However, they often lose precise details and facts within their vast parameters and probabilities when it comes to news, revenue summaries, or function calls in an app. Consequently, companies have struggled to rely on LLMs for critical and high-value use cases until the advent of RAG.

This piece is mainly influenced by Alex Honchar’s article

In the image above you can see a simple RAG-based application that takes user prompt, pulls additional context that is relevant for the prompt, and feeds to the LLM. Also, if you interact with pure models, you’ll notice each time you ask them something they are a clean sheet, with no memory whatsoever, pure amnesia.

The whole point of RAG pattern is to gain small, tiny improvements and accumulate those as much as possible. Every single small detail that improves the result should be considered and tested. That’s the idea behind RAG. As of 2024 LLM models tend to better respond to simplistic short questions and with additional context to the initial user prompt and this will probably stay that way for a long time if not forever. RAG is ever evolving approach of improving quality of your AI application manipulating and sitting on top of your models, a polishing last step.

  • Retrieving relevant information from external sources and enriching the context provided to the model improves its responses while using existing models without modifying them.
  • Provides various patterns to decompose complex user prompt into smaller prompts with instructing guidance and examples.
  • Unlike Bing/Copilot, RAG apps typically have one or a couple of tools to work within its toolbox.

Advanced RAG

In the image below you can see the context retrieval step that is a basic or “Naïve RAG”. To make an “Advanced RAG” you’ll need to add optimization steps immediately before the context retrieval and after.

https://arxiv.org/html/2312.10997v5

Populating databases and Indexing optimization: Preparing and optimizing the data and its structure to enhance the quality of the content being indexed. A sliding window of overlapping chunks for retrieval with a larger context window at inference runtime or even a sliding window with summaries of multiple chunks. Cleaning irrelevant, ambiguity, and updating outdated information, confirming factual accuracy, maintaining and updating information. Adding metadata tags to enrich chunks for filtering reasons, such as text classification categories or tags, date, purposes, page number, file name, authors, chapter, subsections, and timestamp. When metadata is not sufficient to provide additional information for different types of contexts logically, experiment with multiple indexes for different types of documents with index routing at retrieval time. Enhancing data granularity, optimizing index structures, alignment optimization, and mixed retrieval. Combining Graphs & Vectors to add information from graph structure to capture relationships between entities. Creating graphs of non-sequential but related chunks for retrieval during inference. In the case of large amounts of files create a hierarchical index of linked chunks so that retrieved chunks from multiple different documents unrelated but semantically similar would not result in a mess of data that might even contradict each other. Get LLM-friendly input from a URL or a web search with jina.ai/reader.
Pre-Retrieval optimization: Optimize the original query to make the user’s original question clearer and more suitable for the retrieval task with query routing, metadata filtering, query rewriting/transformation, and query expansion. Prepare queries to extract data from sources other than vector DB that require knowledge of data, and its structure and have strict filtering query language grammar such as SQL and Cypher. Hybrid search fusion of keyword-based and vector search results is ideal for a human-like search experience with exact phrase matching for specific terms, such as copy-paste error messages, product names, or serial numbers. And rule-based-retrieval. HyDE, Sub-Query, and MultiQueryRetriever techniques.
Retrieval optimization: Surrounding Window Retrieval: get chunks before and after the chunk that was found with similarity search to better identify similarity and keep surrounding context. You can go even with few embedding using and can utilize dense and sparse models. Fine-tuning embedding models to a domain-specific context to improve similarity search in vector space.
Post-retrieval optimization: Feeding too much relevant information can lead to overload, losing the focus on key details with irrelevant content. Semantic similarity sometimes is inaccurate. Therefore post-retrieval compression is needed to highlight selected and essential information, keeping critical sections, based on rules or re-ranking and shortening or removing less relevant context to summarize and compress. Re-ranking models, LLMs, Okapi BM25 reranking function or FlashRerank library, can help to recalculate the relevance scores of each retrieved contextual document to new or complex domains against the user prompt to relocate the most relevant content to the edges of the prompt. Check if the context has an answer to the question with a specially trained model.

RAG & Types of Data

Knowing types of data and how it's stored will define how your RAG can pull this information and how this context will be fed to your models. The bottom line you need some tool that can extract smaller pieces of information from your dataset. For instance, structured data makes sense to store in Relational Databases and retrieve with smaller pieces of your dataset using SQL. Unstructured data might be too large, and you might need to split it into smaller pieces and then store and index them in Vector DB. Semi-structured data might come via APIs, or stored in the NoSQL Databases and retrieved with CQL, MQL, GQL, or Cypher queries.

When building your RAG pipeline, you might need your application to utilize some of the tools that pull smaller pieces of information as context into your LLM model.

Though it may vary, generally speaking structured data and tools will produce better results vs. Semi-structured. Semi-structured are better vs. unstructured generic tools of semantic search such as Vector DB. Though you’ll likely will have all data types, so you may need all of these tools.

Primarily Structured Data Example: Imagine we have hospital data with the next tables: hospitals and their locations; patients, their Date of Birth and type of blood; Insurance companies; Physicians with the graduation year and name of the school; visits with date, location, billing amount, number of the room, physician, patient, type of visit, treatment description and the chief complaint. Now the Mica patient might have left reviews for the physician Andre after she visited the hospital in Cincinnati and we want to store that too.
At first glance, this data should be stored in a Relational DB and extracted with SQL tools. This is probably correct, though notice there are also things like chief complaints and patient reviews, and best to extract and process with semantic search (Vector DB). Semantic search is useful when you need to answer questions about Mica’s experiences, feelings, perceptions, sentiments, or any other qualitative information. It’s not useful for answering objective questions that involve mathematical functions such as counting, percentages, aggregations, or listing facts like how many times Mika visited Bod. For instance, if the prompt is “Is Mica happy with her care provider Andre?”. In this case, you probably need both: Relational DB & SQL extraction objective information and semantic search to extract qualitative data.

Wren AI is a notable and convenient Open-Source product for generating Text-to-SQL queries. It can run both locally and in Kubernetes (k8s). Wren AI features a user interface (UI) and an API that allows you to make queries and receive both the generated SQL and the extracted data. This greatly simplifies working with structured data.
www.getwren.ai

RAG Phases

RAG involves two phases. Commonly found issue with beginners to combine these which is a bad idea for production since you don’t want to load gigabytes of data with each user request:

  1. Processing, preparing, indexing, and upserting information to store in some sort of database. This is usually done by the developers one-time, on-demand, or scheduled.
  2. Triggered by the user prompt request in runtime. Retrieving smaller portions of relevant info from the DB that will be accompanied by the user prompt for additional context for the inference model.

Embeddings and Vector Databases

Embeddings are numerical representations of any information. They allow to determine similarity, to empower quick search, classification and recommendations. Imagine a digital library with a vast collection (our dataset). Each book represented by coordinates — a numerical vector that captures the essence of the book’s content, genre, style, and other features with a unique ‘digital fingerprint’ for every book. When a user is searching for a book, they provide a search prompt. The library’s search system converts this prompt into an vector coordinates using the same embeddings method it used for all the books to search through the library’s database. The system looks for the book vectors that are most similar to the prompt vectors. The books with the closest matching coordinates are then recommended to the user in the search results based on the initial request.
Another simplest use-case example would be if you are looking for a synonymous of a word. Embeddings can help you to find similar or “close” words, but it can do more than that. Semantic search a very effective way to find related information to your prompt fast and that’s how Google Search Engine works.

The classic novel “Pride and Prejudice” by Jane Austen is known by a different name in some countries — it’s called “First Impressions” in some translations and adaptations. Despite the different names and languages, embedding these in a vector database would reveal their close semantic relationship, placing them near each other in the vector space.

Let me give you another example. This is better understood in comparison to how humans look at data vs. computers: Imagine you are looking for nearby cities to Chicago, IL on a map. If the computer knows the coordinates are {41°88’18"N, -87°62’31"W}, to find a city close to Chicago, it doesn’t need a map, just the list of coordinates of all other cities! Among the cities this spot {41°84’56"N, -87°75’39"W}, is the closest - Cicero, IL. Notice how latitude and longitude coordinate numbers are close. Now we can add additional “dimension” with the size of the city by population, and if the user requests to find the closest city to Chicago with a similar size, the answer could be different for the given prompt. We can add more dimensions. Computers can find similarities in TV comedies, clothes or many other types of information using this algorithm. In scientific language, it would formulate as “Placing semantically similar inputs close together in the embedding space”. And FYI these coordinates are also referred to as latent space.

Embeddings is a very powerful tool to modify user prompt by enriching them with relevant information by placing the user search prompt into categories it belongs to and finding similar information via common categories from other sources. A good example would be daily news that our model is not aware of yet. Instead of baking this new information into the model daily, we simply retrieve the news from other sources and provide the closest and relevant information as additional context with the original user prompt to the model.

Why do we need encoding and representing our dataset in a converted state as embeddings and converting user prompt into embeddings and then search vectors instead of just searching in the original dataset the text of the prompt directly? Because it's fast to process and easy for computers to understand the relationships between information this way. In other words, embeddings of a text that are numerically similar are also semantically similar.

In preparing of the first phase for our RAG application, the information in our entire dataset is split into overlapping chunks and stored in a database (called Vector DB) with encoded numerical representations so that later in the second phase, you can quickly retrieve a smaller portion of relevant information as an additional context for the user prompt. Embeddings encode text from our dataset into index of vectors at the first phase and store them both in the vector database. Then on the second phase of application runtime the user prompt is also encoded with the same Embeddings model and the index with generated vectors for the user prompt is used to search and retrieve from the Vector DB chunks of text similarly like search engines work. That’s why they called Bi-encoder models. To encode text with numerical vector representations Embeddings model used which is typically much smaller, that LLMs. And the beauty of searching Embeddings similarities stored in Vector DB is no need to know your data nor any schema to make this work. Today, virtually all embeddings are some flavors of BERT model.

Advantages & Disadvantages of Embeddings:

Embeddings, despite their popularity, have a notable limitation: they lack transitivity and summarized concepts over large data. This has implications for interpreting and responding to queries in RAG systems. In vector space, when traversing disparate chunks of information through their shared attributes in order to provide new synthesized insights if vector A is similar to vector B, and vector B is similar to vector C, it does not necessarily mean that vector A is similar to vector C. When a user’s query, represented as vector A, gets B but seeks information that aligns with vector C, the direct similarity might not be immediately apparent via vector B. Also disadvantages of Embeddings evident when trying to provide synthesized insights or holistically understand summarized semantic concepts over large data.

These limitations can lead to suboptimal situations where RAG systems, return only 60%, 70%, or 90% correct answers, rather than consistently achieving 100% accuracy.

While embeddings may not always be correct, they always return something, making them reliable in that regard. You might start thinking about what use of such relatability if no quality is guaranteed though its simplicity often is a prerequisite to work with more complex data such as Semantic Layer, more about it is a few chapters below. One of the key advantages is that you do not need to understand your data or have a schema to retrieve information, simplifying the initial stages of working with complex data. When implemented correctly and combined with other techniques, embeddings can have a positive compounding effect, which explains their widespread use despite their inherent limitations.

Retrieving from a Vector database is not the only way, you can retrieve data in many ways, from a Relational Database from tables or via APIs such as Google Maps or Yelp. You may want to use Vector database if you don’t have any other more convenient ways of storing and retrieving your data.

Theory

Workshops

Chunking

Imagine we have a series of Sherlock Holmes books, we couldn’t insert the entire book series as context to the LLM, but we can insert smaller pieces of text from the books. When you store your data in a database, you want it to be split into pieces which will be injected as context into your RAG since the context is constrained by your LLM prompt size limitations. With RAG all details matter for improving, even how you split your information into chunks. You can simply split your text into a fixed number of characters, but then what would that amount be? Now probably we should consider chunks to “overlap” to not loose important information and need to decide how much that overlap would be also. You might want to consider semantic chunking to split text maybe into words, sentences, paragraphs, and sub-sections to preserve the semantic integrity of the content that is contextually relevant, some clunking algorithms are capable of dynamically identifying break points of significant deviation in meaning while splitting large corpus of information into smaller pieces. And there’s more than one semantic chunking out there. Some even propose to auto-generate a short “rolling” or “sliding” summary that includes the previous few chunks into the current chunk, for each chunk. These are all things you’ll have to evaluate test and figure out on your own what's working best for a given scenario. But for starters consider a simple “naive” approach with a fixed 500-character chunking splitter function with an overlap of 100 characters and go from there as you may need much smaller or bigger chunks or maybe semantic chunking would be better for your particular application. And then try a Recursive character splitter for instance. Notably, there are tools for dynamic chunking based on the file type such as unstructured.

Theory

Workshop

Understanding Vector Databases

A Vector Database (Vector DB) stores data in a multi-dimensional space, much like a table with columns with coordinates. Imagine a table with columns for longitude, latitude, and city name. Adding a third dimension, such as height, gives us four columns. In mathematics, dimensions can be numerous — imagine adding city population and area, resulting in six columns.

These dimensions can represent various characteristics of an entity, such as a city or a word. This multi-dimensional space allows us to find proximity or similarity between entities. For example, words may have multiple meanings or synonyms, and even typos can be accounted for just like City proximities. A Vector DB can store these relationships and find the proximity of text chunks to user prompts, even accommodating typos.

Embedding models convert text into vectors, representing these chunks numerically in the Vector DB. When an embedding model encodes text, it produces unique to the embeddings model signature of “coordinate system” in a multi-dimensional space. This encoding must be consistent: if you use OpenAI’s text-embedding-ada-002 model to encode and store data, you must use the same model to encode user prompts for compatible search coordinates.

Imagine you encode something with one encryption algorithm; you must use the same algorithm to decode it. If you want to drive a Volvo, you cannot bring keys from another Mercedes. Similarly, if you encoded data with an embedding model and stored it in a Vector DB, you must use the same embedding model to encode the user prompt to produce a compatible system of coordinates.

In practice, the number of dimensions can be vast, sometimes reaching hundreds or thousands. More dimensions often mean better quality, but they also require more disk space, computational power, and memory. Different Vector DBs support various numbers of dimensions and similarity search algorithms. As of 2024, the leaderboard of embedding models supports up to 4,096 dimensions. Ensure your Vector DB can handle the dimensions produced by your chosen embedding model.

Example of several embeddings models and dimensions (coordinates).
Model, Dimensionalities produced. OpenAI:

  • text-embedding-ada-002– 1, 536
  • text-embedding-3-small — 1, 536
  • text-embedding-3-large — 3, 072

VectorDB and max dimensions supported:

  • Qdrant — has no limits on the amount of dimensionalities
  • Milvus — 32, 000 dimensionalities maximum
  • Weavite — 65, 000 dimensionalities maximum
  • Neo4j — 4, 096 dimensionalities maximum

Not long ago most LLMs had a very short context window size, but recently significantly improved by jumping from 4096 input tokens to impressive 32k, 128k or even more. Some might argue that Vector DB might no longer be needed, though it’s very debatable and controversial since injecting 32k tokens of text into an LLM might cost a lot, take too much time and what’s most importantly might confuse LLM with a vast amount of information. Also models with large context window typically cost more per token. At least for now LLMs tend to reply better with a shorted amount of information and smaller chunks injected into context will simply compute faster and arguably producing better results.

Examples of VectorDB:

Read more about Vector DB here:

Similarity Search & Prompt Augmentation

Splitting the user prompt question into sub-questions might also give good results if it's too complex. Another fundamental challenge in Similarity Search is that user prompts often lack the precise wording or structure that aligns with the language of relevant documents in the Vector DB, leading to suboptimal search results. Query transformations aim to address this issue by modifying queries before the retrieval stage.

The Hypothetical Document Embeddings (HyDE) method prevents hallucinations by asking an LLM to generate a mock-up hypothetical response to the user prompt and then using the vectors of the hypothetical response to enhance search quality. Though the hypothetical response may contain inaccuracies, it captures relevant patterns of how the response could look like that help find answer-to-answer embedding similarity. HyDE might not apply to every task it can be helpful in fields requiring precise information, such as medicine. It also can improve internal document searches, boosting productivity for various tasks such as web search, QA, and fact verification.

The drawback to this method if the discussed subject is entirely unfamiliar to the LLM and could lead to generating incorrect information.

Another Spin of this is the MultiQueryRetriever technique that based on the user prompt generates a few, say 5 similar to the user prompt, to fuse all possible similar chunks, then refine with a Cross-Encoder model to remove chunks that are not relevant and re-sort them in order of relevancy. The drawback is a potential misinterpretation of the original intent.

Sub-Query splits original user prompt into pieces to make them simpler and typically involves at least a primitive planning mechanism that would check the answers to the subqueries and compose answers back into one.

Google Step-Back Prompting technique paraphrasing a question to a more generic form, that creates semantic level abstraction in which the answer can be absolutely precise. When you are asking for some specifics, first get a list of all the options, for example converting “What’s the best electrical car brand in the US covered “Which electric car brands in the US offer the highest percentage of tax credits relative to the car price?” to “Which electric car brands in the US offer tax credits?”. Now having the full list of cars offering tax credits, LLM would easily answer precisely.

Tree of Thought RAG: Multi-Layered Reasoning

Tree of Thought in RAG

This approach involves generating multiple layers of information, reasoning, and chain of thoughts to obtain various answers and re-evaluate and assess their relevance to the user prompt. So why would you need RAG? To improve the quality of the final result, make LLM think better, combine results from multiple LLMs, and force it to reason better.

When building your RAG application pipeline and designing happy path scenarios, ensure your LLM is also equipped to handle error flows and edge cases. Inform the LLM that errors can occur, clearly describe how to identify these errors, and provide instructions on how to respond when they do.

Repo with explaining the article above:

Memory: Short and Long-Term Considerations

Short memory refers to the immediate context in the conversation history, while long history encompasses all previous interactions to personalize responses and to utilize extremal data sources as context. Autonomous Agents (also known as Agentic RAG) can decide if they should use memory.

If you decide to provide convenience to the user of your application having conversation memory, you’ll start thinking about how to implement it and asking yourself how exactly this memory going to work, does it remember the last 10 conversations or a summary of the chat history for the short memory? Having a summary of a few last conversations can greatly help in some scenarios. You can’t have all the conversations, in summary, what about older messages? At this point, you’ll realize that you need to store your short memory somewhere. The Vector DB probably will be your best bet for both short-term and long-term memory for the chat history.

Developing a robust strategy for how your AI application remembers history is crucial for improving user experience. Incorporating human confirmation for what should be remembered can significantly enhance results. Depending on your specific use case, you may also allow the agent to autonomously decide when to save information. By balancing automated memory management with human oversight, you can optimize the functionality and reliability of your AI application.

Long-term MEM: Structured Data

Beyond storing chat history in a vector database, you might want to use a relational database such as PostgreSQL, Oracle RDBMS, MySQL, or similar, as a source of external knowledge of your structured data. Equivalently to vectors, to retrieve data from your relational database, you’ll need to convert user prompts into SQL queries. This process, known as Text-to-SQL, is relatively straightforward for OpenAI GPT models, which can produce high-quality conversions.

There are models specifically trained on SQL datasets. These models can run locally on your servers and are supported by tools that help build pipelines to generate SQL queries and retrieve data from your database such as Wren AI. This approach typically provides arguably the best possible outcomes from your dataset.

https://paperswithcode.com/task/text-to-sql
https://huggingface.co/datasets/b-mc2/sql-create-context
https://github.com/taoyds/test-suite-sql-eval
https://github.com/salesforce/WikiSQL
https://query.wikidata.org/

Long-term MEM: Knowledge Graphs and Semi- & Un-structured Data

One of the ways of facilitating retrieval of interconnected and relevant information with additional context for the user prompt is to utilize a Graph Database such as Neo4j, SurrealDB, FalkorDB, Nebula Graph, JanusGraph, CayleyGraph, or Dgraph which organizes data into networks that are naturally capable of traversing disparate pieces of information through their shared attributes and summarizing semantic concepts over large data.

According to Gartner, “Through 2025, at least 30% of GenAI projects will be abandoned after proof of concept (POC) due to poor data quality, inadequate risk controls, escalating costs or unclear business value.” Arun Chandrasekaran, Distinguished VP Analyst at Gartner

https://www.gartner.com/en/articles/highlights-from-gartner-data-analytics-summit-2024

KGs can help improve data quality, mitigate risks, and reduce costs.

Knowledge Graphs (KG), unlike Vector DB, not only index chunks of information but establish logical dependencies between entities within chunks of text. Unlike Vector Search, to search in a database you have to better understand schema and your data to get results. While KG is often used for question-answering, it can also enhance the question-generation process as well.

Ontology

If you are familiar with Schema terminology from relational DB, this is a similar term but for free raw text for a simpler data structure than SQL.

Just like with SQL Schema, Ontology in essence is also a form of classification or taxonomy organizing and indexing knowledge into groups or types to easier find this information.

Consider this example: “Andre is driving to Chicago”. Now we split this into 3 pieces (triplet), in this example:

1. “Andre” is a Subject (Node)
2. “is driving to” is a Relation (Edge)
3. “Chicago” is an Object (Node)

A node represents entities such as place, person, organization, etc. An edge represents a connection between these entities like events, dependencies, or owners.

Now we can practice creating categorizations for Subject, Relation, and Object to classify and organize our data for easier search. Consider another example: “Volvo V70 has a radio”. So now we may want to have:

  • Subject can be with label: Person or Machine
  • Relation of two categories: “ACTING” and “OWNS”
  • Object label: Equipment, City or Machine

This is our schema — our ontology. When you are building a Knowledge Base, you need to produce ontology for a given text. You can build ontology with entities (“Chicago”, “Andre”, “Volvo V70”) or concepts (“City”, “Person”, “Car”, “Radio”). Nodes are in round brackets; the relations are in square brackets and the arrow represents the direction of the relation. So, the data in the graph would look like:

(Person: Andre) -[ACTING: Drives]-> (Car:  Volvo V70)
(Person: Andre) -[OWNS]-> (Car: Volvo V70)
(Person: Andre) -[OWNS]-> (Pet: A cat)
(Person: Andre) -[ACTING: Surgery]->(City: Cincinnati)
(Person: Andre) -[ACTING: Works]-> (City: Cincinnati)

https://protege.stanford.edu
https://github.com/topics/attribute-extraction

Choosing Between Existing Ontologies and Customization

Utilizing existing maintained ontologies presents several advantages, including community support, updates, and commitment to best practices. However, it may not always be suitable.

Semantic Graph Technology (RDF, OWL, SPARQL, SHAQL, RDF*):

If you have word knowledge structured data, open-domain question-answering (ODQA) tasks leverage existing ontologies such as:
DbPedia.org
Schema.org
Wiki.goodrelations-vocabulary.org/Cookbook/Schema.org
Productontology.org
foaf-project
lov.linkeddata

Established ontologies are ideal for common data types like events, organizations, websites, Wikipedia, and people, ensuring interoperability with other platforms. Examples include schema.org for web markup, DBpedia.org for Wikipedia, and FOAF for social network data.
Building custom ontology is necessary when existing ontologies impose constraints and lack domain-specific concepts and relationships. We require precise modeling of unique relationships and full control over the evolution and governance of the data still in research or experimental contexts where new concepts or relationships are being explored. Examples include proprietary or specialized data where existing ontologies are not suitable.
A hybrid approach, combining established and custom ontologies, may be useful. By extending established ontologies and incorporating domain-specific concepts, you can strike a balance between standardization and customization, ensuring optimal data representation. Examples include data with some ground truth for the established ontology and custom build for your commercial data.

Text-to-Graph: Automated knowledge graph extraction

LLM models trained on publicly available established ontologies such as FOAF, schema.org and DBpedia.org can extract good quality Knowledge Graphs about world knowledge without involving humans turning text into data. For example, Babelscape/rebel-large model is trained for Relation extraction (RE) between entities. OpenAI gpt-3.5-turbo model is pre-trained on the established public ontologies or via InstaGraph and demonstrates good results while it also can successfully create fine-tuning examples given the custom ontology in RDF TTL format and then by fine-tuning with those examples capable of good quality Text-to-Graph extraction for that custom ontology.
This might also be a place for a tiny improvement, and we want to accumulate as many tiny improvements as possible as they all add up into a better product, so consider at least some human touch to such automatically produced ontologies.

https://docs.llamaindex.ai/examples/index_structs/knowledge_graph/KnowledgeGraphDemo/

Incomplete Graph

In reality most of the time you’ll end up with your knowledge graph with some Ontology, that turned out to incompletely cover your data. For example: “Sherlock Holmes while visiting Chicago deducted that physician Andre from Cincinnati drives his Volvo every day back home for 2 hours”. If we didn’t produce an ontology that captured Andre’s driving home and that the home was in Chicago, then our knowledge graph is incomplete. So, in reality, you will typically most of the time will assume that your graph is incomplete. And other means might be still useful to accompany the knowledge graph, that's where a backup plan is a good idea — Vector DB. Also If capture user feedback in the chatbot you may be able to automatically or semi-automatically extract additional metadata and ontology of your dataset. Fortunately, Graph databases unlike SQL/Relational Databases are something called “schema-free” which is not the best name if you ask me, I would call it “schema-later” as you still want your schema/ontology in place and in a shape as best as possible to extract more meaningful information from a large dataset. The good news about this is that the schema-later allows flexibility of adding your ontology as you go, improving and extracting more and more insights from your dataset over time and enriching your KG.

Workshop

Learn Cypher query

Vectors, Graphs and LLM

Graphs are useful when discrepancies arise between retrieved information and user intent in Vector Search. In scenarios where Query Augmentation methods like HyDE fail to deliver results, incorporating graphs becomes essential. Since Vector DB retrieval relies on similarity searches, which may yield inaccuracies, graphs provide a solution for precise matching requirements. When hybrid vector search and re-ranking processes prove ineffective or fine-tuning becomes expensive, the integration of graphs emerges as a viable addition. At first glance, you might think Vector DB and Graph DB are mutually exclusive or that you use one or the other. Interestingly, combining them can yield better results.

One scenario is populating and enriching your knowledge graph at the first RAG phase. Imagine you have a large body of text, like the complete works of Sir Arthur Conan Doyle. An LLM can analyze the text and produce an ontology or schema for the books. However, this process might result in synonymous nodes and relationships, as well as unrelated entities that appear similar.

To address this, you can insert all produced entities from the nodes and relationships into a new vector space placed by the Embeddings models. This will position similar entities close to each other in the Vector DB. Then, you can identify close pairs and evaluate these pairs using Pairwise Similarity Evaluation again utilizing an LLM to determine if they are indeed synonymous. By repeating this process, you can identify multiple pairs of synonymous entities, increasing the accuracy and relevance of the search results.

Conversely, you can extract information from the existing knowledge graph to use it as metadata infusing chunks in the Vector DB. This metadata can help filter chunks relevant to the user’s prompt. For example, if the user prompt includes a year, and a city and your chunks are infused with tagged metadata such as “year=2024, city=Chicago”, you can filter and search only those chunks with the specified year and city and then search similarities only among them, reducing noise.

Modular RAG: v2.5

Types of RAG: Naïve, Advanced & Modular

If LLM is the first step and RAG is the second, then Advanced and Modular RAG represent significant advancements beyond these initial steps. Historically, after the introduction of Advanced RAG and Agents, the concept of Modular RAG began to take shape, bridging the gap between the two. Unlike Naïve and Advanced RAGs, which follow a more rigid, sequential, and linear workflow, Modular RAG is more flexible and adaptive. It extends beyond chain-style algorithms into a multi-tiered, modular format.

Modular RAG can utilize multiple tools, deciding dynamically whether to use or skip them, iterating on a single tool as needed, and creating branched flows running in parallel and fusing later. This can involve summarization or merging different information streams, creating a more dynamic, adaptive, and versatile workflow. Unlike Agents, Modular RAG is not meant to perform tasks in the real world, and on an abstraction level, if Agents represent a worker person or even a team of coworkers, then RAGs still operate with modules and tools. More about Agents in the next chapter.

The Framework of Modular RAG. Source

Remember all those stages from the Advanced RAG: Indexing, Pre-Retrieval, Retrieval, and Post-Retrieval? Now we can pack those into modules and operate different tools dynamically within and across them. The main modules are: Indexing with Chunk optimization and Structural organization; Pre-Retrieval with Query Routing, Expansion Transformation, and Construction; Retrieval with Selection and Tuning; Generation using On-prem, Cloud inference, and Model Fine-tuning; Post-Retrieval with Re-rank, Compression, and Selection; Orchestration with Planning, Scheduling, and Fusion.

Popular RAG flow designs. Source

Agents: The Multi-Tool Ensemble v3.0

The underlying models act as the brain of your agent, driving its ability to process information and make decisions. Agents can interact with the world in various ways, while RAG primarily focuses on answering questions in a chat format. Agents are especially useful for tasks within the digital realm of the real world. Simple example: as you remember models have amnesia therefore, they do not know what day it is today, so you might want to give it a tool that can get that information for it. Focus your agent on its ability to “think”.

From a practitioner’s perspective, it’s important to know that agents rely on “function API calling,” and your inference serving engine must be able to process such API calls. Examples of inference engines that support function calling include OpenAI, Ollama, LocalAI, and Kserve. LiteLLM library allows you to check if your inference supports function calling. Additionally, the model itself must handle function call requests. Examples of inference engines and models that support function calls are OpenAI/gpt-3.5-turbo, Mistral, LocalAI/llama3–8b-function-call-v0.2, gorilla-llm/gorilla-openfunctions and Ollama/mistral:v0.3.

Think of an Agent as a worker that can perform one or few tasks. What if you need more than one worker? You may rely on Multi-Agent Orchestration Framework Systems such as CrewAI (built with LangChain), LangGraph, and HuggingFace Transformers Agents to distribute tasks across a team of workers for collaboration on your task. In a team multiple agents as your employees can communicate directly with each other and report to a manager Agent and can redo their work, create and track planning for the user task, further expanding the ideas of Advanced and Modular RAGs. Since all of these ideas were created by different people at different times or sometimes in parallel, we often have multiple names for the same thing, and Modular RAG focuses on the RAG pipeline, with Agentic capabilities called not Modular Agents, but Compound AI Systems.

Elevating AI Agents to Production-Ready Solutions

Deploying AI apps into production requires more than just robust LLM capabilities; it demands a substantial investment in essential non-AI components to ensure seamless operation and user trust.

AI agents often require real-time data from various external knowledge systems. This involves integrating with APIs and connection protocols for both internal and third-party systems. These integrations need continuous maintenance and updates to ensure reliability to ensure production quality.

For users to trust an AI agent, they must be able to follow and audit its actions such as providing links to the original documents how the answer and be able to open them. Allowing users to track and interact with the agent’s workflow and see each tool call made by the app via an interactive interface provides transparency and builds user confidence in the app’s reasoning and decisions.

Semantic Layers: Enhancing Text-to-Thing Conversions

You may have noticed in this article a concept of converting text into various forms, such as Text-to-Text, Text-to-NER (Named Entity Recognition), Text-to-Topic, Text-to-Tag, Text-to-Sentiment. LLMs can also generate queries like SQL and Cypher based on user prompts, facilitated by libraries such as LangChain with classes like GraphCypherQAChain and Create_sql_query_chain or produce programming code. You can even find dedicated models trained to produce SQL queries or generate code, such as Python, directly from user requests. However, the quality of these outputs heavily relies on the instructions and examples provided in the prompt template. Often, the generated queries are “good enough” but may contain invalid syntax or hallucinations. On the bigcode-models-leaderboard you can see that models are very good but not 100% accurate.

Instead of solely relying on hardcoded prompt engineering to produce Text-to-Something outputs, we can enhance outcomes by shifting our focus to code engineering that provides safety guardrails for LLMs. By developing functions that handle invalid syntax or incomplete user information, we can ensure more accurate query generation and retrieval. Semantic Layers are typically developed for domain-specific tasks, such as coding or data retrieval from databases, breaking the silos between knowledge and data, and providing guardrails to guide LLMs toward producing better results. These layers often store prompt examples, help descriptions, and metadata in memory, such as a Vector DB.

For example, Wren AI has built its platform with a Semantic Layer specifically designed for database interactions. Similarly, the Semantic Kernel SDK is dedicated to integrating functions in your applications.

An interesting example is Semantic Router instead of slow tool-use decisions based on LLM.

Semantic Kernel

Semantic Kernel is an SDK by Microsoft that interacts with functions in your own app or plugins providing guardrails to generate better quality instead of directly asking an LLM for function calling. With Semantic Kernel, one of the most important parts is metadata that describes your existing code (functions or plugins) for LLM models as context. Which can then request the appropriate function to be called as needed to perform specific business functions. Semantic Kernel translates the model’s response into a call to your code.

Semantic Kernel combines natural language semantic functions, traditional code native functions, and embeddings-based memory, allowing you to build applications with AI capabilities. You can enhance your applications with various advanced techniques, including prompt engineering, prompt chaining, retrieval-augmented generation, contextual and long-term vectorized memory, embeddings, summarization, zero or few-shot learning, semantic indexing, recursive reasoning, intelligent planning, and access to external knowledge stores and proprietary data.

Semantic Kernel, helps developers to define plugins that can be orchestrated by an LLM planner. Plugins, the building blocks of the Semantic Kernel SDK, are pieces of code that can be written in Python, C# or Java, and can perform various tasks such as calling external services, manipulating data, and generating content. Plugins should be annotated with attributes that describe their functionality and parameters to the LLMs annotations stored in-code (for PoC) but typically for production environments would be stored in the Vector DB. Functions can also be chained and executed together to create complex workflows.

Enhancing Text-to-Cypher

Interacting with a knowledge graph directly just by providing its schema using an LLM doesn’t always yield the desired results because the LLM needs sufficient context and semantics about the data. It must understand the metrics (as Total Sales and Average Order Value), dimensions (as Customers with Orders), entities (as Customers and Products), and relational aspects of the data. Introducing a semantic layer as an intermediary between the LLM and the knowledge graph addresses this issue. The semantic layer organizes data into meaningful business definitions, allowing AI agents to generate accurate queries by providing the necessary context.

The querying application is as important as the definitions themselves. By forcing the LLM to query data through the semantic layer, it ensures the correctness of queries and returned data. The semantic layer abstracts complex joins and metric calculations, providing a simple interface that operates on business-level terminology rather than Cypher node and relationship names.

This simplification helps to protect the LLM from hallucinations and becomes more error-resistant. For instance, the Semantic layer can have multiple functions, that act as tools. Also, an AI-based application can read the semantic layer, download all its definitions, and store them as embeddings in a vector database. We will retrieve these embeddings at runtime with a user prompt as dynamic context for the LLM which then sends the generated query to the semantic layer that the application executes. This process can be repeated multiple times to answer complex questions or create summary reports, enhancing accuracy.

Agent-Computer Interface (ACI) Layer

The scope of ACI is broader compared to the Semantic Layer discussed earlier. While the Semantic Layer is responsible for understanding and interpreting data meaning and context, the ACI covers all aspects of the interaction between AI applications and external and internal computer systems.

The quality of the generated reply when using function calling or strict syntax with LLMs heavily relies on the instructions and examples provided in the prompt template. Relying solely on LLMs to interact directly with external knowledge or data sources by providing schemas or function variables may not always yield the desired results, as LLMs require sufficient context and semantics about the tools, functions, data, and syntax. Results may contain invalid syntax, emphasizing the importance of a protection layer in defining the exact syntax and structure of the agent’s tool calls and data.

Acting as an intermediary between the LLM and external data sources, the ACI translates tasks into executable commands, and defines the methods for correct API calls. By abstracting syntax, data relations, and communication protocols, the ACI safeguards the LLM from hallucinations and enhances error resistance. Transitioning from hardcoded prompt engineering to code engineering can yield improved outcomes. Developing the ACI to handle invalid syntax or incomplete user information ensures more accurate query generation and retrieval.

The ACI layer can incorporate multiple functions acting as tools, enhancing the accuracy of complex queries through iterative processes. Enabling the AI agent to execute tasks such as running applications, accessing files, and communicating via the network, the ACI should also include feedback mechanisms to provide the agent with information about the success or failure of its actions.

Monitoring and Adjusting ACI

Small adjustments in the ACI can have significant impacts, similar to how a minor traffic incident can lead to a major pile-up. Changes in names, quantity, level of abstraction, input formats, and output responses of tools can cause considerable fluctuations in the app’s performance. For instance, some models might prefer working with JSON, while others might work better with Markdown.

It is essential to monitor how your agent handles instructions and be vigilant about potential hallucinations or errors.

ACI Optimization

Misunderstandings of argument instructions can lead agents to take shortcuts or omit required parameters in tool calls.

  1. Pay attention to the ACI inputs and outputs are clearly defined and correctly formatted. Watch for hallucinations and failures.
  2. Evaluate the agent’s performance, especially after making any tweaks. Look for significant changes in behavior that may indicate issues.
  3. Choose a data format each model understands better.
  4. Study how the underlying models process information and instructions. Pay close attention when models misinterpret commands. If agents don’t understand the argument instructions well, they often take shortcuts or completely ignore the required parameters or syntax.

Fine-Tuning for AI Agents & RAG

External Knowledge & Model Adaptation Requirements. Inspired by: https://arxiv.org/html/2312.10997v5

Fine-tuning models to improve agent performance can be counterproductive as agents may “shortcut” their directions or assume that the examples they were fine-tuned on always represent the correct approach and sequence of tool calls instead of independent problem-solving and normal reasoning. Therefore fine-tuning should be carefully executed and might not always be required. Fine-tuning might be valuable for specific tool calls, for instance, when as a first step you operate on a reasoning model without fine-tuning which indicates the need to execute a SQL query. This request as a second step can then be handled by the fine-tuned model specialized in SQL queries for your data. Similarly, you may rely on Domain Knowledge (DK) models. This ensures accurate tool execution while maintaining neutral reasoning.

Think of an Agent as a worker that can perform one or few tasks. What if you need more than one worker? You may rely on Multi-Agent Orchestration Framework Systems such as CrewAI (built with LangChain), LangGraph, and HuggingFace Transformers Agents to distribute tasks across a team of workers for collaboration on your task.

ReAct Agents: Simplifying Complex Queries v3.5

Reason & Action (ReAct) agents enhance the capabilities of AI by combining reasoning and action in a systematic cycle: Thought ⇒ Action ⇒ Observation. This approach deconstructs complex queries into manageable sub-questions, prompting LLMs to generate verbal reasoning traces and a chain of thoughts and actions for each task. This dynamic reasoning enables the creation, maintaining the right direction, and adjustment of action plans while allowing interaction with external environments to incorporate additional information. By addressing each sub-question with specialized tools, ReAct agents provide comprehensive and accurate answers, making them an effective solution for handling intricate queries.

The “Magic Sauce” in the ReAct framework is the prompt that encourages the LLM model to use the below thought process: Question, Action, Action Input, and Observation.

```
Answer the following questions as best you can. You have access
to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
… (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}
There are three placeholders {tool}, {input}, and {agent_scratchpad} in this prompt. These will be replaced with the appropriate text before sending it to LLM.
```

Text-to-Code Generator and Executor: Autonomous Agents

There are dedicated LLM models specifically trained to produce code. Some multi-agent systems, like OpenDevin, Devika, and Microsoft’s AutoGen, have advanced further by executing the generated code. Which on the one hand a potential security issue. But on another may greatly improve your answer quality.

LLMs might hallucinate when answering questions related to news or mathematics and dealing with domain-specific data analytics tasks with rich data structures. They may have difficulty with flexibility to satisfy diverse user demands. While RAG can address the news issue, mathematics remains a challenge. For instance, if asked, “What is 1/3?”, an LLM might not always provide the correct answer unless specifically trained for such queries. Though advanced LLMs, still face issues with precise numerical calculations. They may approximate answers unless trained with specific data points similar to the user query.

Text-to-Code Generator and Executor

When dealing with tasks requiring exact calculations, such as converting Fahrenheit to Celsius, LLMs might provide approximations based on their training data. For precise computations, a better approach is to have the LLM generate a code snippet, such as Python, to perform the calculation for it. The LLM can then test the results of the produced code against known correct values to ensure accuracy. Then your Agent can execute the code. This combination of LLM and generated code enables precise and reliable numerical results and opens up a new chapter for improved AI applications.

Domain Knowledge Specific LLMs

Opposite to GenAI models such as Mistral-7B, Mixtral-8x22B, OpenAI GPT, Google Gemini, Antropic Claude, Mistral, Llama, Groq, and others.

Examples of domain-specific LLMs:

Examples of Task-Specific models:

Other notable open-source Models:

Why Augmenting Inference APIs

Whether you run your models locally using tools like Ollama, LocalAI, LM Studio, Jan, Llamafile, GPT4all, or through paid services like OpenAI, Azure OpenAI Service, Groq, Anthropic, MistralAI, Cohere, Google Vertex-AI, Replicate.com, or HuggingFace Text-Generation-Inference, you’ll soon realize that these APIs differ. These differences can range from small nuances to large discrepancies, even among APIs that claim compatibility. Even if APIs are compatible, they are rarely 100% so.

There are other aspects of compatibility, for example, you may have access to OpenAI’s text-embedding-ada-002 model and LocalAI’s text-embedding-ada-002 model, but they are entirely different and incompatible in terms of dimensionality and produced vector spaces.

Using multiple models from different services can be cost-effective. You might run a small model locally for tasks like generating Text-to-SQL queries, making it inexpensive, while opting for Groq’s LLM Chat model for its speed and HuggingFace’s for the embeddings model to achieve better performance and cost efficiency.

To manage these variations and facilitate future changes, consider implementing an abstraction layer between your AI agent app and the inference APIs, such as LiteLLM. This approach allows flexibility and scalability as your needs evolve.

Savings Strategy: Multiple Models in Your AI App

When building your AI app, you’ll quickly discover that LLMs can perform numerous specialized tasks such as context verification, reranking, generating Text-to-SQL, code, and embeddings. Instead of relying on a single, expensive model from one vendor, consider integrating smaller, open-source free for commercial-use models. Fully or partially replacing more expensive models if properly used. These open-source models don’t reason well yet but can excel in their respective tasks, in comparison with proprietary services, and can significantly reduce costs. Models will continue to gradually evolve and improve, including open source. While enormous and impressive State-of-the-Art (SotA) models from well-known paid services can be highly effective, they may be overkill for specific tasks. Always compare performance and cost and adopt a strategy that incorporates multiple models. Having multiple different services and models is a good saving strategy. Keep this strategy in the back of your head but don’t optimize for cost too early.

Testing your AI app, benchmarks

Building your AI application is just the beginning. You’ll face numerous configuration variables: chunk size, tool application for different data types, vector search algorithms, number of top results (Top-K) quality of the results and reranking returned by your Vector DB, choice of LLM chat models, embeddings models with varying dimensions, prompt engineering, tools for structured and semi-structured data extraction, and RAG workflow steps. These variables influence the quality, speed, and cost of your results. Therefore, thorough testing, evaluation, and benchmarking are essential.

Configure variables and experiment with different configurations: Chunk size, Vector search algorithms, LLM chat models, Embeddings models, and RAG workflows.

Key Aspects of Evaluation

  1. Overall Workflow Success Score: Assess how well the agent’s workflow achieves the desired outcomes.
  2. Performance tuning: Identifies areas for improvement and optimizes model parameters. Individual Steps, Tools, and Data Accuracy Score to verify the precision of each tool call, such as information retrieval or code execution.
  3. Model comparison: For understanding how different models perform under similar conditions.
  4. Insights and analysis: Provide a deeper understanding of model functionality, highlighting both successes and areas needing improvement.

Create and Evaluate Test Scenarios

Develop test cases to assess performance and identify areas for improvement.

  • Split your text into chunks and insert these pieces into your Vector DB. During runtime, your app searches for these chunks to integrate into the context of your Chat LLM model.
  • Reverse Testing: Start with the chunks and have the LLM generate questions and answers based on each chunk. Ragas and LlamaIndex tools can help generate test data to your knowledge. Store these Q&A pairs to create test cases. Feed only the questions to your app one by one and compare the responses with the pre-generated answers. This method helps identify discrepancies and areas for improvement.
  • Set objective/completion pairs to quantify performance. The objective is the initial task directive, and the completion is the final tool call indicating task success. Capturing intermediate tool calls and the app’s thought process helps diagnose failures or changes in the tool call sequence.

Automate Evaluations

Models supplemented with occasional human oversight yield better results though it's impossible to evaluate every step manually. Utilize tools, LLMs, and frameworks to automate evaluation and testing:

  • LangChain and LlamaIndex can help with basic evaluations.
  • TruLens-Eval library can generate evaluation metrics such as: Context Relevance; Context Groundedness to compare the response with the retrieved chunks; Answer Relevance to the final response.
  • Phoenix platform sets metrics for the quality of generated embeddings and the LLM’s responses.
  • RAGas framework focuses on Question-and-Answer types of evals providing a structured approach to checking accuracy, relevancy, and contextually appropriate responses. Evaluating retrieving information separately and in conjunction with generation is important and RAGas can provide detailed visibility into that by dividing retrieved-context relevancy metric into context precision, context relevance, and context recall.
  • BeyondLLM evaluation framework has benchmarks like Context Relevance, Answer Relevance, Groundedness, and Ground Truth, evaluating information retrieval, LLM response accuracy, and factual correctness.
  • AutoRAG tool evaluates multiple RAG module combinations (Such as Embedding model, Chat model, reranking, Top K, prompts) and finds an optimal pipeline configuration for your dataset, and supports evaluation metrics.
  • Other notable framework tools: DeepEval, OpenCompass platform, and LangSmith service to quantify answers, and quality and identify unanswered questions. And there are even more: RAG Triad of metrics, ROUGE, ARES, BLEU.
  • OpenAI Evals. OpenAI Evals framework provides standardized tasks and performance metrics and customizes evaluations to specific needs.
  • Annotation tools such as JohnSnowLabs/Spark-NLP and Label-Studio can help in labeling your input and output results helping to evaluate your data manually or hybrid with the help of LLMs. One of the ways how it can be useful in RAG is by applying labels to your chunks in the Vector DB, which can be used as filters.

Iterate and improve with LLMOps/GenOps

Use evaluation results to identify weaknesses in your application. Continuously refine and enhance your AI application based on the insights gained from testing and evaluation.

There are sets of automation practices such as MLOps, LLMOps, and GenOps providing guidelines and blueprints on how to automate and set workflows for continuous improvements. Based on those practices you can find various LLMOps tools, MKOps tools, and implementations. These practices not only focus on the LLM model quality training and finetuning but also include RAG/Agents. MLOps tools generally can adapt to GenOps practices. Notable tools: Zenml, Allegroai/Clearml, KubeFlow, MLflow.

Speed

Beyond quality, assessing your app’s speed is crucial. Are some answers taking too long to process? It might be worth considering if you can slightly reduce the quality of an answer to achieve a 2x speed improvement. Or vice versa, if the speed is less relevant and your goal is to maximize detail and quality, potentially at a lower cost.

There are speed benchmark tools available to help you make these decisions, balancing speed, quality, and cost to optimize your app’s performance.

Tracing

When you develop a RAG/Agent application it’s important to have some sort of tracing tool that can show you the flow of data inside your application that can explain a chain of thought your application passes through for troubleshooting and give you a better explanation of why you end up with one result and not the other. For example, your Vector DB might not return relevant information to your request which might be the reason why your app is not answering your questions correctly, so you might want to take a closer look at how your app reacts internally to improve results.

Example of Tracing & Prompt Engineering platforms:

More resources to read:

User Feedback: Tailoring AI Interactions

The app can evaluate user feedback to enhance future interactions and improve user experience by using reactions such as 👍, 👎 and user feedback to the response. RAG can help to fix that by taking into account user feedback from the current conversation history and providing it as a context.

Security

AI agents must operate under strict user control and access. Implementing secure OAuth integrations with Single Sign-On providers and Role Based Control Access (RBAC) to ensure users have access only to the information they are entitled. Effective security is not just a necessity but a competitive feature. To ensure security for your AI apps, address these key risks: Prevent Prompt Injection by validating and sanitizing inputs; avoid Insecure Output Handling by validating LLM outputs; secure and verify training data to prevent Training Data Poisoning; monitor resource usage and implement rate-limiting to avoid Model Denial of Service; assess and secure all third-party components and datasets to mitigate Supply Chain Vulnerabilities; protect sensitive information in LLM outputs to prevent Sensitive Information Disclosure; design plugins with strong access controls to avoid Insecure Plugin Design; limit LLM autonomy and implement oversight to prevent Excessive Agency; critically assess LLM outputs to avoid Over-reliance; and secure access to proprietary LLMs to prevent Model Theft.
OWASP Top 10 for LLMs

Product Development: Bridging into Production

LangChain, LlamaIndex, Haystack, and DSPy frameworks simplify development by enabling direct interaction with database files, eliminating the need for resource-intensive database instances during prototyping. However, for production, it’s crucial to use dedicated database applications that interact via API, ensuring scalability, reliability, and performance.

These frameworks accelerate development by removing complexity and facilitating faster results during prototyping. However, when building an AI-driven product, relying on abstractions may sometimes limit control over backend processes. In production environments, especially for tasks such as onboarding users, debugging issues, scaling to more users, logging agent activities, or understanding agent behaviors, it may be necessary to move beyond these abstractions.

Workshop

Local Testing with Docker Compose

Docker Compose enables testing of the AI application locally while manipulating applications via environment variables. Many projects such as LocalAI, and Ollama decided to utilize almost the same API as OpenAI fully or partially allowing you to run models such as Mistral or Llama locally and easily replace them with ChatGPT later.

For beginners, Flowise is a simple and convenient product with a drag-and-drop UI for creating RAGs and Agents. It helps you consolidate and master the principles described in this article and visualize the application pipeline. I highly recommend starting with Flowise to learn about RAGs and Agents, as it supports both LangChain and Llamaindex frameworks.

However, Flowise is not very suitable for production due to the lack of features and details available when coding your app, such as working with Knowledge Graph Databases and other advanced functionalities. The convenience and simplicity come at the cost of extensive settings and features available via programming only.
github.com/FlowiseAI/Flowise

Deployment: Model Accessibility and Continuous Integration

The locally run AI models without GPU or specialized equipment may end up responding with delays such as 1-minute but might be acceptable for development. If you happen to have Apple Silicon (that utilizes Metal GPU Framework) or a PC with NVIDIA GPUs or AMD Graphics that can significantly speed up the responses of your model. That also might be a strategy to drop the price of your app by developing it locally with docker containers and deploying it into a production environment such as k8s later so you would want to have CI/CD pipelines in place, having GitOps tools such as ArgoCD and FluxCD can simplify this process greatly.

Returning back to start

There is no end to perfection and if all is done at this point you might consider returning to the very beginning of this article: consider Training a new LLM model from scratch for your specific task or finetuning an existing model. And at that point with your new model, you’ll probably start better understanding how to improve your RAG/Agents app…
Yep, circle of life. Or a circle of AI.

Summary:

Decomposing a user prompt into smaller prompts and injecting additional context tends to yield better results with LLMs. LangChain and LlamaIndex frameworks aim to provide a platform for RAG (Retrieval-Augmented Generation) by enhancing prompts with decomposition and context. By understanding and effectively utilizing RAG pipelines and Agent patterns, developers can create responsive AI systems capable of meaningful real-world actions, thereby improving response quality in specific areas and reducing hallucinations. Understanding and managing your data is crucial, as is testing, evaluation, and benchmarking to measure and enhance your app. Augmenting APIs can prevent vendor lock-in and aid in local testing and development. Properly applied, small, specialized models can save money and do their job well. The synergy between the ReAct paradigm, embeddings, knowledge graphs, and LLMs facilitated by RAG and Agent concepts paves the way for the next generation of AI applications. Understanding your app and elaborate ways of juggling your data can significantly reduce costs.

Enjoyed This Story?

If you like this topic and you want to support me:

  1. Clap 👏 my article 10 times; that will help me out
  2. Follow me on Medium to get my latest articles 🫶
  3. Share this article on social media ➡️🌐
  4. Give me feedback in the comments 💬 below. It’ll help me to better understand that this work was useful, even a simple “thanks” will do. Give me good, give me bad, whatever you think as long as you tell me place to improve and how.

Disclaimer: This blog is not affiliated with, endorsed by, or sponsored in any way by any companies or any of their subsidiaries. Any references to products, services, logos, or trademarks are used solely to provide information and commentary and belong to respective owners. The views and opinions expressed in this blog post are the author’s own and do not necessarily reflect the views or opinions of corresponding companies.

--

--