RAG Explained: How Integrating External Data Enhances AI Responses

Published in

tiket.com

6 min readMay 30, 2024

Introduction

The rise of Large Language Models (LLMs), namely ChatGPT, Gemini, and LLaMA has transformed the way we interact with Artificial Intelligence (AI). These advanced tools open up new possibilities and applications, revolutionizing various fields. However, there’s always a room for improvement. This is where Retrieval-Augmented Generation (RAG) steps in, offering a method to enhance the accuracy and relevance of the LLM outputs by integrating external databases.

Why was RAG Invented?

Despite their impressive capabilities, LLMs face significant challenges.

Here are the key issues:

Hallucinations: LLMs might generate plausible-sounding but incorrect information.
Outdated Knowledge: They are trained on data up to a certain cutoff point, which can lead to outdated responses.
Non-transparency: The source of information in LLMs’ responses is often unclear, making it difficult to trace the reasoning behind the answers.

RAG aims to mitigate these issues by incorporating information from external sources, thus enhancing the reliability and transparency of the generated responses.

How does RAG Work?

RAG integrates external knowledge sources into the LLM workflow through three key steps: Indexing, Retrieval, and Generation.

Step 1: Indexing

Indexing involves cleaning and extracting raw data, which is then encoded into vector representations using an embedding model. These vectors are stored in a vector database, enabling efficient retrieval based on user queries.

Step 2: Retrieval

When a user submits a query, the system retrieves relevant information from the vector database. This stage ensures that the LLM has access to the most pertinent and up-to-date data.

Step 3: Generation

The retrieved information is synthesized with the user query to generate a response. This approach provides better context and improves the accuracy of the output.

Example of RAG in Action

Consider a user is querying on OpenAI’s CEO Sam Altman’s sudden dismissal. The RAG system would:

Break down the relevant documents into chunks and convert them into vector representations.
Store these vectors in a database.
Retrieve relevant chunks when the user submits their query.
Integrate the retrieved information into the LLM’s prompt, enabling it to deliver a well-informed and accurate response.

By connecting LLMs with external databases, RAG provides more contextually rich and reliable outputs, addressing the limitations of traditional LLMs.

Delving Deeper: Indexing Strategies and Optimization

Different data types can serve as data sources for the RAG system, including unstructured text, semi-structured PDFs, and structured data like knowledge graphs and tables. You can even connect various data sources to your RAG.

To optimize the indexing process, configuring the chunk size of documents is very important to balance the context and noise. Larger chunks capture more context but may include more irrelevant information, while smaller chunks are less noisy but might miss out on important details.

Enriching chunks with metadata (such as page numbers, file names, and categories) further enhances the system’s accuracy and efficiency by enabling better filtering of relevant documents during the inference stage.

A More Advanced RAG Systems

Standard RAG systems still struggle with precision and recall during retrieval, potentially missing crucial information and generating hallucinated or irrelevant content. Therefore, advanced RAG systems introduce pre-retrieval and post-retrieval strategies to improve the quality of outputs.

Pre-Retrieval Strategies

Query Expansion

Before retrieving relevant information from the vector database, we can enhance the query through query expansion. This strategy enriches the query’s content and context by transforming a single query into multiple related queries. There are different types of query expansion:

Multi Query Expansion: Use LLM to generate several variations of the original query, process them in parallel, and combine their results to provide a comprehensive answer.
Sub Query Expansion: Break down complex queries into simpler, more focused sub-queries, and then combine the answers to form a complete and detailed response.
Chain of Verification (CoVe): After expanding the query, have the LLM validate the expanded queries to ensure relevance and accuracy. This reduces the likelihood of hallucinations by confirming alignment with the original query’s intent.

Query Transformation

Query transformation involves rewriting the original query to make it more suitable for the retrieval process. This is particularly useful in real-world scenarios where users may not always phrase their queries optimally. By prompting the LLM to rephrase queries, we can improve the retrieval of relevant information. For example, a user query like “What happened to OpenAI’s CEO?” can be transformed into a more precise query like “Details on the recent changes in OpenAI’s executive leadership.”

Query Routing

Query routing directs queries to specialized pipelines based on their content, enhancing the relevance and accuracy of the retrieval process. There are two main types of query routing:

Metadata Router: Extract keywords or entities from the query and route the query to a specific pipeline based on these keywords.
Semantic Router: Route queries based on their semantic content to improve the system’s performance by creating specialized tasks.

Post-Retrieval Strategies

After retrieving documents from the vector database, we can employ various strategies to enhance the quality of the generated response:

Re-ranking: Reorders retrieved documents to prioritize the most relevant ones. This can be done using a trained model, rule-based approach, or LLM-based re-ranking.
Context Selection and Compression: Summarizes and filters the retrieved documents to remove noise and focus on essential information. This ensures the LLM receives only the most relevant context, improving the accuracy of the response.

Augmentation Techniques in RAG

Iterative Retrieval

Iterative retrieval involves repeatedly searching the knowledge base to refine and enhance the generated answer, improving its robustness. This approach starts with the user query, retrieves relevant documents, generates an answer, and iterates if necessary until a satisfactory response is achieved.

Recursive Retrieval

Recursive retrieval breaks down complex problems step by step, retrieving documents and generating responses iteratively to provide detailed and in-depth answers. The process involves creating new queries to explore specific aspects of the problem more deeply, leading to more comprehensive results.

Adaptive Retrieval

Combining iterative and recursive retrieval, adaptive retrieval uses a judge module to decide whether a simple or detailed response is needed, optimizing the response generation process. If the query is straightforward, a direct answer is generated. For more complex queries, the system iterates to gather additional context.

Implementing RAG with LangChain

LangChain simplifies the creation of LLM applications by providing a framework with various components:

Retriever: Connects to external sources like vector databases and other data repositories.
Model: Represents the LLM, such as GPT-4 or Gemini.
Prompt: The input to the LLM, which can be plain text or a template.
Output Parser: Parses LLM outputs into specific formats like JSON or CSV.
Agent: Decides actions, such as iterating outputs or selecting specialized pipelines.

A simple RAG system can be implemented by connecting the retriever to a prompt template and then linking it with a chat model. This setup enables users to query an LLM that leverages a specified vector database, enhancing the accuracy and relevance of its responses. However, LangChain also gives you freedom to do more complex system by using other components and tools.

Conclusion

Retrieval-Augmented Generation (RAG) represents a significant advancement in the capabilities of Large Language Models (LLM). By integrating real-time, relevant data from external sources, RAG addresses key limitations such as hallucinations, outdated knowledge, and non-transparency. Implementing RAG systems, especially with tools like LangChain, allows for the creation of more reliable, accurate, and contextually rich LLM applications.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Wang, H. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. Retrieved from https://arxiv.org/pdf/2312.10997

LangChain. (n.d.). Retrieved from www.langchain.com website: https://www.langchain.com/