Getting Started with Retrieval-Augmented Generation

Rania Fatma-Zohra Rezkellah
InfinitGraph
Published in
9 min readAug 6, 2024

1. Understanding the Motivation Behind Retrieval Augmented Generation

Large Language Models (LLMs) encode vast amounts of information directly into their parameters through processes known as pre-training and fine-tuning. While these models are capable of generating sophisticated responses, they often face three significant challenges:

  1. Keeping their knowledge base current with the latest information. This challenge, usually referred to as “knowledge cutoff” arises because once an LLM has been trained, any information or events that occur after this training phase are not reflected in the model’s responses.
  2. Dealing with LLM’s hallucinations which refer to the fact that an LLM generates responses that contain inaccurate information that can be hard to detect due to the text’s fluency (Intrinsic hallucinations which occur when the generated text logically contradicts the source content present in the input prompt, or Extrinsic hallucinations which occur when the generated text contains factual errors.)
  3. Satisfying the increasing demand for personalized interactions exposes a limitation in base LLMs: their tendency towards generic responses. This gap between users’ and businesses’ expectations, and the generic outputs provided by these models necessitates a focus on personalization within LLMs [1].

So the question is: “How can we ensure that a Large Language Model’s knowledge remains up-to-date, mitigate its tendency to generate inaccurate information, and adapt its outputs to specific contexts?”

To update a previously trained model’s knowledge, we typically need to fine-tune it on new data, which can be resource-intensive and time-consuming. For instance, if a user asks a model about an event that happened in 2024, and the model was last trained with data from 2023, the response will be based solely on the available data up to that point. This limitation becomes particularly problematic when dealing with rapidly changing information. As an illustration, querying GPT-3.5-Turbo about the recently released Llama-3 (April 18, 2024) resulted in the following output shown in Figure 01.

Figure01: ChatGPT Knowledge Cut-off

Beyond temporal limitations, LLMs also lack access to specific, non-public data — such as proprietary or confidential information held by companies. General-purpose LLMs are trained on publicly available datasets, which do not include internal documents, sensitive data, or company-specific knowledge due to privacy and security restrictions. As a result, even if such data existed before the model’s knowledge cutoff, it would not be incorporated into the model’s responses. This limitation means that while LLMs can provide generalized information, they often lack personalization and are usually unable to address queries requiring detailed, internal company data, which might be crucial for tasks such as internal reporting, strategic decision-making, or customer support. And to make it worse, LLMs may sometimes generate plausible-sounding but incorrect information — “hallucinating” — even when they have no information about that topic, which can be particularly problematic when users are unable to verify the accuracy of the generated content due to a lack of subject-specific expertise.

2. RAG Systems entering the game

Retrieval-augmented generation (RAG) aims to address the aforementioned problems by allowing LLMs to retrieve facts from an external knowledge base to supplement their internal representation of information. In other words, it combines an information retrieval component with a text generator model to provide more specific, up-to-date, and factual information to LLMs at the generation time.

2.1. What’s RAG?

More formally, RAG is a new paradigm proposed for the first time by Patrick Lewis, et al 2021 [2] as a solution to the critical challenges stated earlier. It combines the pre-trained parametric memory of the model gained during pre-training (And stored in the model’s weights) with a non-parametric memory present in external data sources [2].

Specifically, RAG performs by taking input and retrieving a set of relevant documents from a specified source, such as Wikipedia or a company’s internal knowledge base. These documents, concatenated with the original input prompt, serve as the context for the LLM, ultimately producing the final output in a more precise, accurate, and trusted way.

Figure 02: Overview of a RAG System

With that being said, one can confirm that RAG is particularly valuable in scenarios where facts may change over time, including domains with a source of evidence or knowledge: the medical field where medical knowledge is continuously evolving with new research findings and discoveries [3] [4], legal research since laws and regulations are subject to frequent updates and amendments [5], and customer support responses to ensure that customers receive timely and accurate assistance, leading to higher satisfaction levels.

2.2. RAG main components and process

To understand how RAG systems work, it is essential to look at their key components described in Figure 03 [6].

Figure 03: Retrieval Augmented Generation Main Components
  • Input: The request to which the LLM system responds is referred to as the input or query.
  • Indexing: Related documents are processed by chunking them into smaller parts, generating embeddings for these chunks, and indexing them into a vector store. Indexing is akin to organizing books in a library to facilitate quick and easy searches. Just as a well-organized library allows one to find the right book without hassle, indexing in RAG ensures that the system can swiftly locate the most relevant information for a given query.
  • Retrieval: The relevant documents are obtained by comparing the query against the indexed vectors, identifying ”Relevant Documents.”
  • Generation: The relevant documents are combined with the original prompt as additional context. This combined text is then passed to the model for response generation, which is prepared as the final output for the user.

The system’s workflow involves several key steps, each leveraging these components to produce accurate and contextually enriched responses.

  1. Data Preparation

Mainly follows three major stages as shown in Figure 04;

Figure 04: Data Preparation for a RAG System

a. Preprocessing & Chunking

The first step in RAG is transforming raw data into a ”knowledge base”, consisting of a vector store with searchable embeddings and associated metadata. First, the data coming from different sources should undergo a preprocessing layer to clean it and ensure all the data used is of good quality.
To integrate large documents efficiently, such as books or research papers, preprocessing via ”chunking”, which means breaking down bigger documents into smaller manageable units, is necessary.

Figure 05: Documents Chunking

b. Embedding

Each chunk is then converted into a vector representation using an embedding model. Choosing the appropriate embedding model is crucial, as it directly affects the quality of how the knowledge is represented. But with so many models available, how do we select the best one for our specific use case?

A great starting point might be the MTEB Leaderboard on Hugging Face. This comprehensive resource provides the most up-to-date list of proprietary and open-source text embedding models, along with detailed statistics on their performance across various tasks like retrieval, summarization, and more.

Figure 06: Chunks Embedding

c. Storage

Once embedded, these chunks, along with associated metadata, are stored in a specialized database specifically designed to efficiently store, index, manage, and search through massive quantities of high-dimensional vector data, called a vector database.

Figure 07: Data Storage

2. Retrieval

Querying the vector database involves executing a semantic similarity search algorithm, usually being a machine learning model, between the user query and all the document embeddings stored in the vector database. Unlike traditional keyword search or Boolean search operators, a machine learning-powered semantic search system leverages its training data to understand the context and relationships between the different search terms. For instance, in a keyword search, “coffee” and “café” might be treated as distinct terms. However, a semantic search system, through its training, understands that these words are closely related and often appear together in contexts involving coffee shops and beverages. As a result, a search query like “best coffee places” might prioritize results about popular cafés and their specialty coffees.

The user query is first converted into embedding space. This is done by either using the same text embedder used for embedding the chunks during the initial pre-processing step or a special model trained alongside it. The similarity search algorithm is used to rank chunks in decreasing order of semantic similarity to the user query.

Typically, Maximum Inner Product Search (MIPS) algorithms are used to find the most semantically similar documents to the query. MIPS algorithms use either dot product, Euclidean distance, or cosine similarity.

Figure 08: Relevant Documents Retrieval Step

3. Augmentation & Generation

a. Augmentation

With the relevant data retrieved, the RAG system enriches the original prompt with this new information, providing additional context that enhances the understanding of the user’s request.

For instance, if the query was about recent advancements in cancer treatment, the enriched prompt might include excerpts from the latest medical research articles or expert opinions from oncology specialists. This process allows the generative model to create responses that are not only pertinent but also enhanced with the latest facts and data that were not part of the LLM’s original training.

b. Generation

At this stage, the enriched prompt is input into a large language model. The LLM, which has been trained on vast amounts of text, leverages its generative capabilities to synthesize information from the enriched prompt into a coherent and contextually relevant response. This could take the form of answering a question, summarizing information, or generating a detailed explanation, depending on the nature of the initial query.

Finally, the generated response is refined and presented to the user. This output is more precise, informative, well grounded in the data sources and contextually relevant than what could be achieved by a simple generative model without the enriched input.

Figure 09: Prompt Augmentation and Answer Generation

3. Wrap-Up

Yes, large language models have revolutionized text generation, achieving impressive fluency and coherence in their outputs [7]. However, they face significant challenges, especially in dynamic and specialized contexts, including maintaining up-to-date knowledge, avoiding hallucinations, and providing personalized responses that meet user and business needs.

Retrieval-Augmented Generation offers a promising solution to these issues. By integrating external knowledge sources, it enhances LLMs with the ability to access and utilize more current, context-specific, and well-grounded information, reducing the likelihood of hallucinations and enabling more personalized outputs.

Thank you for taking the time to read this article, throughout which, we’ve delved into the basics of the Naive RAG technique, covering its architecture, components, and processes. Future articles will explore how to refine and advance this basic RAG approach to better cater to the needs of both individuals and businesses. Stay tuned for these updates!

If you have any questions or comments, please feel free to reach out to the InfinitGraph team or directly to me 🤞 simply via:

Email : jf_rezkellah@esi.dz Or LinkedIn: Rezkellah Rania Fatmazohra

Bibliography

[1] J. Eapen and V. Adhithyan, “Personalization and customization of llm responses,”

[2] Lewis, P., Perez, E., Piktus, A., et al. 2020. “Retrieval-augmented generation for knowledge-intensive NLP tasks.”

[3] Xiong, G., Jin, Q., Lu, Z., & Zhang, A. (2024). Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178.

[4] Lála et al., 2023; Zakka et al., 2024;” Retrieval-augmented language models for clinical medicine . NEJM AI, 1(2):AIoa2300068.

[5] Wiratunga, N., Abeyratne, R., Jayawardena, L., Martin, K., Massie, S., Nkisi-Orji, I., … & Fleisch, B. (2024, June). CBR-RAG: case-based reasoning for retrieval augmented generation in LLMs for legal question answering. In International Conference on Case-Based Reasoning (pp. 445–460). Cham: Springer Nature Switzerland.

[6] Y. Gao, Y. Xiong, X. Gao, et al., “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023.

[7] A. Jignasu, K. Marshall, B. Ganapathysubramanian, A. Balu, C. Hegde, and A. Krishnamurthy, “Towards foundational ai models for additive manufacturing: Language models for g-code debugging, manipulation, and comprehension,” arXiv preprint arXiv:2309.02465, 2023.

--

--

Rania Fatma-Zohra Rezkellah
InfinitGraph

AI MSc student @Paris Dauphine | AI MSc graduate | Computer Science Engineer @ESI Algiers | NLP enthusiast