Retrieval Augmented Generation: Grounding AI Responses in Factual Data

Minhajul Hoque
6 min readJun 17, 2023

--

Retrieval-Augmented Generation (DAG)

In the ever-evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) is making waves. This innovative approach combines the power of large language models with the reliability of factual data retrieval. In this blog post, we will delve into the intricacies of RAG, its advantages, disadvantages, and alternatives.

Context

Large language models, such as ChatGPT, have revolutionized natural language processing. However, they have a tendency to “hallucinate,” or generate information that sounds plausible but is not grounded in facts. This is where Retrieval Augmented Generation (RAG) comes into play.

RAG enhances the reliability of these models by grounding their responses in factual data retrieved from a vector database. This approach not only ensures the accuracy of the generated information but also provides a reference point for users to verify the data. Furthermore, by focusing on factual data relevant to specific domains, RAG allows large language models to concentrate more effectively on domain-specific tasks. This results in more accurate, reliable, and contextually relevant outputs.

How RAG Works

Retrieval Augmented Generation (RAG) is a sophisticated approach that combines the strengths of retrieval-based models with seq2seq generation. Unlike traditional seq2seq (text-to-text) models, RAG introduces an intermediary step that involves retrieving relevant documents before generating a response. Here’s how it works:

  1. Document Retrieval: Initially, the input sequence, usually a question or prompt, is used to perform a semantic search in a vector database. This search retrieves a set of relevant documents. These documents serve as an external knowledge source that the model can refer to. Note that any type of data can be stored in vector database as long as they are converted into vector embeddings.
  2. Combining Input with Retrieved Documents: The input sequence is combined with the retrieved documents to form an extended context. This extended context contains both the original input and additional information from the retrieved documents.
  3. Passing to Decoder Transformer: The extended context is then passed to a decoder transformer. It’s important to highlight that RAG is compatible with any decoder transformer, not just specific ones. This flexibility allows it to be integrated into various architectures and applications.
  4. Generating Response: The decoder transformer processes the extended context and generates a response. The response is not just based on the input sequence but is also informed by the information in the retrieved documents. This ensures that the output is grounded in factual data and is relevant to the input.

By incorporating an external knowledge source through document retrieval and using a decoder transformer, RAG effectively bridges the gap between retrieval-based and generative models. This results in more informed and accurate responses for your specific task compared to a traditional seq2seq model.

Advantages of RAG

  • Reduces Hallucination: By grounding responses in factual data, RAG reduces the chances of generating incorrect or fabricated information.
  • Facilitates Fact-Checking: Users can verify the information by checking the sources from which the data was retrieved.
  • Enhanced Accuracy on Domain Specific Tasks: Providing relevant documents as context can make the generations more accurate and useful for your specific task.
  • Flexibility: RAG is highly flexible. You don’t need to retrain the model to get different outputs; you can simply change the data in the vector database.
  • Cost-Effective for Companies: Companies with an existing database of relevant data can use RAG as an alternative to fine-tuning, which can be resource-intensive.

Disadvantages of RAG

  • Dependent on Semantic Search: The effectiveness of RAG is highly reliant on the quality of the semantic search. If the search retrieves irrelevant or low-quality documents, the generated responses may also be of poor quality.
  • Requires Existing Data: RAG depends on having an existing database of documents to retrieve from. Without a substantial database, it’s not possible to leverage the benefits of RAG.
  • Latency Issues: The two-step process of first retrieving documents and then generating responses can introduce latency. This might not be suitable for applications that require real-time responses.
  • Context Length Limitation: We have to be cautious of the maximum context length that the decoder transformer can handle. For example, ChatGPT (gpt-3.5-turbo) has a maximum context length of 4096 tokens (which is ~3 pages of single-lined English text). If the combined length of the input sequence and the retrieved documents exceeds this limit, some information will have to be truncated, which can affect the quality of the response.

Alternatives

  • Fine-Tuning the Model: Instead of using RAG, one can fine-tune the whole language model on a specific dataset to achieve desired outputs. This requires much more effort and training time compared to DAG, but it might help with the latency issue.
  • Fine-Tuning Transformer Attention Head: This involves fine-tuning the attention head of the transformer with retrieved documents. To achieve this, an additional encoder is integrated into the standard seq2seq model. This encoder is specifically tasked with processing the retrieved documents. The outputs from this additional encoder, along with those from the original encoder, are collectively used to inform the decoder. This method essentially refines the model’s attention mechanism, enabling it to be more discerning and effective in utilizing the retrieved documents. It requires a more involved setup, but can potentially enhance the model’s performance. This requires much more effort, but might be an interesting alternative to training the full network. This is a fairly new method, It’s advisable to conduct thorough testing and research to ascertain its efficacy and reliability before deploying it in a production environment.

Try Out RAG

# Import necessary modules from the Langchain library
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
# Load the document from which information will be retrieved
# In this example, we are loading a text file named "state_of_the_union.txt"
loader = TextLoader("../../state_of_the_union.txt")
documents = loader.load()

# Split the loaded document into smaller chunks
# This is done to make the document more manageable and improve retrieval performance
# In this example, each chunk is 1000 characters long with no overlap between chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Generate embeddings for the document chunks
# Embeddings are numerical representations of the text that are used for semantic search
embeddings = OpenAIEmbeddings()

# Create a vector store for semantic search using the embeddings. You can choose any vector databse.
# Chroma is a vector store that efficiently handles large sets of embeddings
docsearch = Chroma.from_documents(texts, embeddings)

# Initialize the RetrievalQA model
# This model uses a large language model (in this case, OpenAI's model) for question-answering
# It retrieves relevant document chunks using the vector store and generates a response
# The "stuff" chain type stuffs all of the retrieved document into the LLM context
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever())
# Define the input query
# In this example, we are asking what the president said about Ketanji Brown Jackson
query = "What did the president say about Ketanji Brown Jackson"

# Run the RetrievalQA model with the input query
# This will retrieve relevant document chunks and generate a response
output = qa.run(query)

# Print the generated response
print(output)
"The president said that she is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support, from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

Closing Thoughts

Retrieval Augmented Generation (RAG) is like a superhero team-up, where the brainpower of big language models joins forces with the trustworthiness sidekick of real, factual information. This combination is very useful when you need spot-on answers and want to validate that the info is correct.

RAG depends a lot on how good it is at retrieving relevant documents that actually matter. If the semantic search isn’t accurate, your generation will not be as well.

As you venture into employing RAG, weigh your specific needs and limitations. There are alternative avenues, such as fine-tuning a model to your specific tasks or integrating a knowledge base, which might align more closely with your objectives.

Engineering is all about trade-offs after all…

References

If you found the blog informative and engaging, please share your thoughts by leaving a comment! Additionally, if you’re eager for more content like this, be sure to follow me for future blog posts :)

--

--

Minhajul Hoque

Exploring the frontiers of AI, space, and physics. ML engineer building cutting-edge AI solutions and tackling challenges. Join me on this journey of discovery.