Optimizing RAGs: Overcoming Architecture Hurdles for Peak Performance — Part 1

4 min readNov 7, 2023

The Challenges in different components of RAGs based architecture

Introduction

With the proliferation of Language Model Models (LLM), the imperative to integrate enterprise or external data has become conspicuously essential in order to generate informative and precise responses. This is where Retrieval Augmented Generation (RAGs) comes into play. RAGs facilitate the process of querying a data corpus, such as an enterprise dataset located behind a secure firewall, identifying relevant matches for a user’s query, and utilizing these outcomes to enhance the context supplied to the LLM.

Creating prototypes for RAGs may seem straightforward, but transitioning them to production comes with a myriad of obstacles. My first-hand experience in implementing architectural solutions centered around RAGs has enabled me to record challenges faced at various stages of the process. In this two-part blog series, I’ll delve deeply into the optimization of RAGs’ architecture, sharing insights into the encountered challenges and my efforts to resolve them.

First Component: Process Large Documents

If we are working on prototype, it could be easier to process entire documents. Also recent release of models which could take upto 128k tokens as prompt, so it is become easier to process documents. Although chunking this larger documents into small meaningful chunks is seems to be reducing the hallucination and improving the response quality.

Chunk based on Fixed Size: This is easier to implement and require less computation but may not consider the context of information. In this, we fix the number of tokens in a chunk and whether should be overlap b/w them
Chunk based on Sections/headings: This requires us to design logic to identify sections/headings and extract text between them. Though the each chunk’s size may vary, so need to break larger chunks into smaller ones in single section
Recursive chunking: Recursive chunking involves dividing the input text into smaller chunks in a hierarchical and iterative manner, using a set of separators. If the initial effort to break down the text doesn’t yield segments of the preferred size, the method calls itself recursively on the resulting chunks with different separators or criteria until the desired chunk size is achieved.
Overlapping chunking: Overlapping refers to the practice of allowing adjacent chunks to share some amount of data. The “chunk overlap” is the number of characters that adjacent chunks have in common.
Splitting by character: This is the simplest method. This splits based on characters (by default “\n\n”) and measure chunk length by number of characters

Although there are many chunking strategies depends on the domain and use-case. These are a few popular and common ones. Apart from create chunks of text, the challenges to handle tables and flowchart within documents requires it’s on pre-processing.

Second Component: Embedding of Documents

After segmenting the documents into smaller sections, it becomes essential to create embeddings for these segments. This enables us to conduct subsequent similarity searches on the documents to retrieve pertinent segments, as we will explore further in the upcoming section. This process is a pivotal and enables us to implement various optimization techniques. When generating embeddings for these segments, it is important to consider the following aspects.

Domain-specific Embedding model: The embedding model should be pertinent to domain of the documents. This make the embedding enrich with information and doesn’t let it loose it’s context. Generally, OpenAI embedding model is trained on large datasets acorss domain, so it also generate good embedding vectors
Decoupling embedding representations from raw text chunks: In simpler terms, when we break down documents into smaller pieces, we should create a brief summary that connects to other related documents. This helps us find important documents more easily without having to search through all the chunks. Additionally, when we break a sentence into smaller parts, we should also link it to the surrounding context of that sentence. This way, we can find the right information with more detail, avoiding the issue of losing context in the middle of large text, and still having enough information for language model synthesis.
Adding Metadata: The inclusion of meta data facilitates the presence of topic keywords within the chunk, thus enhancing the search functionality. This is of paramount importance since the chunking approach might yield segments containing factual content but lacking the necessary context. The significance of metadata in the chunk design process lies in its capacity to offer supplementary details, which, in turn, aids in the precise and efficient identification and extraction of meaningful segments.

Conclusion

In this article, we explore the techniques for enhancing the quality of data chunks (Challenge-1) and improving the richness of embeddings (Challenge-2), along with strategies to alleviate these issues to a certain degree. In the subsequent section, we delve deeper into the challenges associated with various components, such as (“Storing embedding in DB”, “Improve Search Results”, “Evaluate LLM Response” etc.)

Kindly visit the second part of this series here

Though there are other many other possibilities that can be tried. So we keep it updated if I come across any new methodologies. I frequently write about developments in Generative AI and Machine learning, so feel free to follow me on LinkedIn (https://www.linkedin.com/in/anurag-mishra-660961b7/)

References:
- https://www.pinecone.io/learn/chunking-strategies/
- https://twitter.com/jerryjliu0
- https://towardsdatascience.com/advanced-rag-01-small-to-big-retrieval-172181b396d4
- https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1
- https://medium.com/madhukarkumar/secrets-to-optimizing-rag-llm-apps-for-better-accuracy-performance-and-lower-cost-da1014127c0a
- https://towardsdatascience.com/10-ways-to-improve-the-performance-of-retrieval-augmented-generation-systems-5fa2cee7cd5c

Optimizing RAGs: Overcoming Architecture Hurdles for Peak Performance — Part 1

Introduction

First Component: Process Large Documents

Second Component: Embedding of Documents

Conclusion

Written by Anurag Mishra