Day 2: Understanding core components of RAG pipeline

Himanshu Singh
7 min readJan 31, 2024

--

This is part of the series — 10 days of Retrieval Augmented Generation

Before we start our second day, let us have a look at what we have discussed and what lies ahead in this 10 days series:

  1. Day 1: Introduction to Retrieval Augmented Generation
  2. Day 2: Understanding core components of RAG pipeline (*)
  3. Day 3: Building our First RAG
  4. Day 4: Packaging our RAG using Streamlit and Chainlit
  5. Day 5: Creating RAG assitant with Memory
  6. Day 6: Building complete RAG pipeline in Azure
  7. Day 7: Building complete RAG pipeline in AWS
  8. Day 8: Evaluation and benchmarking RAG systems
  9. Day 9: End to End Project 1 on RAG (Real World) with React JS frontend
  10. Day 10: End to End Project 2 on RAG (Real World) with React JS frontend

Now, lets continue with our Day 2 — Understanding core components of RAG pipeline

What we will discuss?

In this article, we will discuss the following components, which are the core of any RAG (Retrieval-Augmented Generation) pipeline:

  1. Data and Storage
  2. Indexing of Documents
  3. Approaches to Document Retrieval
  4. Providing Additional Context, Prompting, and Re-ranking
  5. Evaluation of Responses

Let’s start with the first component: Data and Storage.

Data & Storage

The core principle of Retrieval-Augmented Generation (RAG), as the name suggests, involves the retrieval of documents based on questions asked by users. To retrieve these documents, the data must be stored somewhere for later access.

There are various types of documents that can be used for querying in the construction of a RAG pipeline. Some of the most commonly used document types include:

  • PDFs and other textual documents
  • CSV and Excel sheets
  • Databases of various types, such as MySQL, MongoDB, etc.
  • Data received from APIs, like sensor data
  • Knowledge Graphs

For the first two types, it is possible to begin building the pipeline using local storage. However, for live deployment, the documents need to be placed in specific storage solutions, such as Text Blob in Azure, S3 Buckets in AWS, or Google Cloud Storage in GCP.

Once the data is in place, it becomes imperative to index it. Let’s explore what indexing means in the context of RAG.

Indexing of Documents

Suppose you are building a RAG system to answer employee queries related to various company documents, such as HR protocols, tax information, car lease details, etc. With potentially more than a thousand documents, it’s essential for the RAG to index these documents to make them searchable.

Documents can be indexed in multiple ways. Proprietary services like Azure Search, Amazon Kendra, and Solr are some options. However, in RAG systems, a common approach is to chunk the documents, generate embeddings for these chunks, and then store the embeddings. Let’s delve deeper into this.

Consider a document of 1000 pages. Retrieving an accurate answer directly from such a voluminous document can be challenging. As an alternative, we can chunk the document, possibly into sets of 10 pages each, resulting in 100 chunks. There are various chunking methods, including:

  • Character-based chunking
  • Byte-based chunking
  • Line-based chunking
  • Fixed-size chunking
  • Sentence-based chunking
  • Contextual chunking

We will explore these chunking approaches more practically when we build our first RAG pipeline. After chunking, the next step is to generate embeddings.

Embeddings

To understand the meaning of a word in a sentence, or a pargraph and to get eventually the entire meaning of the sentence, AI uses Word Embeddings. Mostly, everyone reading this post must be aware of this concept. If not, then a deep dive into NLP basics and core is required for that. As of now, till I write an article on this, one can understand the concept by visiting this link — https://www.turing.com/kb/guide-on-word-embeddings-in-nlp

So, for the chunks that we obtained, we generate embeddings for them. These embeddings can be generated using various algorithms like -

  • BERT,
  • ROBERTA,
  • GPT variants,
  • PALM 2,
  • TITAN etc.

For our series we will be using GPT based embeddings provided by Open AI. Hence, for all the chunks we eventually get same number of embeddings. These embeddings needs to be saved in Vector Database, which will finish the process of Indexing.

Vector Store

We all understand that embeddings are characterized by dimensions. For instance, a 100-dimensional embedding represents each word as a vector array of 100 floating-point numbers. GPT embeddings, for example, have approximately 1500 dimensions. Imagine we have 1000 chunks, each with a 1500-dimensional embedding; this would constitute a significantly large corpus. To efficiently store and retrieve these embeddings, we need a storage solution where embedding query operations are fast. This is where specialized vector databases come into play, designed specifically for storing embeddings.

The most widely used databases for this purpose include:

  • FAISS
  • ChromaDB
  • PineCone
  • Quadrant DB

The chunk embeddings generated can be stored in any one of these databases. We will delve into this in more detail on Day 3. Now, let’s shift our focus to understanding how documents are retrieved based on a question posed.

Retrieval of Documents

Let’s assume that all documents of an organization are indexed — meaning they are chunked, their embeddings are generated, and stored in a vector database. Now, consider an employee asks a question, such as, ‘Tell me about the company policies related to Business Ethics.’ Here’s how the RAG system retrieves the relevant documents:

  1. Step 1: The employee’s question is converted into embeddings using the GPT architecture.
  2. Step 2: These embeddings are then sent as a query to the vector database. The database utilizes mathematical techniques like Cosine Similarity or Maximum Marginal Relevance to find chunk embeddings that most closely match the query embeddings.
  3. Step 3: The chunk with the highest similarity is retrieved. This is the vector database’s response to the query. The database provides not only the exact text of the chunk but also a summarized answer. The summarization is typically achieved with the assistance of a Large Language Model (LLM), such as GPT-3.5 or GPT-4.

A challenge in this approach arises when multiple chunks are similarly relevant — for instance, if there are 20 chunks with high similarity. Often, the database might return only the first chunk, potentially overlooking more relevant information in the other 19 chunks. To address this issue, we employ an ‘additional context’ approach. Let’s explore this next.

Providing Additional Context for Document Retrieval

To optimize the response accuracy, we refine the retrieval process by instructing the database to return not just the single most similar chunk, but the 20 most similar chunks. These chunks, when combined together along with the user’s question, form what we call a ‘Prompt Template.’

This Prompt Template, a compilation of the user’s query and the 20 potentially relevant chunks, is then forwarded to a Large Language Model (LLM), such as GPT-3.5 or GPT-4. The LLM reviews all the chunks within the context of the posed question. By analyzing this comprehensive set of data, it is able to generate a more accurate and relevant response to the user’s inquiry.

Employing this approach significantly enhances the likelihood of providing the user with the most precise and comprehensive answer to their question. This method leverages the depth and contextual understanding capabilities of advanced LLMs, ensuring a higher quality of response derived from the aggregated chunks.

Re-Ranking

In an alternative method to enhance the accuracy of the document retrieval process, we employ a technique known as ‘Re-Ranking’ once the database returns the top 20 document chunks. In this approach, a specialized agent, referred to as a ‘Re-Ranker,’ analyzes these chunks to reorder them based on their likelihood of providing the best answer to the employee’s question.

The re-ranking process typically involves extracting and comparing features from both the user’s query and the retrieved documents. These features might include:

  • Keywords: Identifying words and phrases that are relevant to the query in both the question and the documents.
  • Word Embeddings: Utilizing vector representations of words to capture their semantic meanings and relationships.
  • Sentence Encodings: Generating vector representations of sentences to grasp their overall meaning and thematic content.
  • Document Summarization: Extracting key points or summaries from the documents for a condensed understanding.
  • Entity Recognition: Identifying and understanding the relationships between entities mentioned within the documents.

After extracting these features, a second round of similarity measurements is performed. This helps in re-ranking the documents, ensuring that the one most likely to contain the precise answer is prioritized and presented to the user.

Now that we know about the core components of RAG, we are ready to build our first RAG solution. Let’s comeback tomorrow and create a full fledged RAG solution that will help us in getting answers to the questions related to Mutual Funds Documents. See ya!!

--

--

Himanshu Singh

ML Consultant, Researcher, Founder, Author, Trainer, Speaker, Story-teller Connect with me on LinkedIn: https://www.linkedin.com/in/himanshu-singh-2264a350/