Introducing RAG: A Developer’s Guide to a scalable, context-driven LLM system

Published in

NewNow

16 min readAug 19, 2024

Authors: Julia Barth, Valentin Milicevic

Imagine you are building a chatbot based on a company’s data lake that can answer complex questions with up-to-date information. How do you ensure that this system responds with the most relevant information? This is the promise of Retrieval Augmented Generation (RAG) — a scalable and context-driven LLM system.

In this article, we will introduce you to RAG and show you why RAG is the best choice for tackling these problems and, therefore, a must for GenAI developers. By doing so, we will not only show you the simplicity behind the hype but also give you technical insights, providing you with everything you need to know as a developer to get started with RAG. This article is the entry point of our RAG series and will equip you with the knowledge to understand our open-source code and advanced RAG system articles.

The general idea behind RAG systems is simple: enhance the LLM’s input with relevant information. Yet by doing so, it 1) allows for context-specific answers from reliable knowledge sources instead of generic answers you would expect from classic LLMs and 2) solves scalability issues due to large data volume beyond what long-context LLMs are capable of.

For example, imagine you are building a customer support chatbot for a tech giant. Generic answers are not an option, which excludes vanilla LLMs like ChatGPT. You need to be able to access and navigate the company’s vast, specific, and dynamic knowledge base to provide an accurate answer. However, the size and complexity of the knowledge base are the downfall of methods like long context windows, which do not scale well. This is where RAG shines. It pairs an LLM with a retrieval system. When a customer query comes in, RAG first fetches relevant documents from the knowledge base and then feeds these, along with the query, to the LLM. The result is a tailored response based on retrieved information, and due to the retrieval step, the system is much more scalable and dynamic.

Here, we see RAG’s true power. It’s not just about enhancing prompts — it’s about fundamentally changing how we approach information retrieval and generation in AI systems. They handle large, dynamic datasets with ease, providing a level of flexibility that traditional models cannot offer. They empower developers to create GenAI systems that are not just smart but adaptive, scalable, reliable, and context aware. Therefore, understanding and implementing RAG opens a world of possibilities.

This article presents the content in the following structure:

Discover the two key components: Learn about the retriever and generator, and how they work together in RAG systems.
Understand why we need RAG: Explore the limitations of alternative models and why RAG systems overcome these, making them essential.
Reverse engineer RAG: Examine the detailed processes that enhance LLM performance through embedding models, vector databases, and similarity search engines.
Explore the RAG working mechanism: Grasp the end-to-end workflow of a RAG system when a query is inserted into the RAG system.
Navigate the technical side of RAG: Gain practical guidance on the steps to build RAG in your projects, from data preparation to building retriever and generator components.

Key Components of RAG — A Simple Explanation

For now, let us break down RAG as simply as possible: A RAG system consists of two main components: a retriever and a generator.

Conceptual diagram of the RAG components and their interactions. The retriever finds relevant documents that are combined with the user query. They are inserted into the LLM to obtain a response based on relevant information.

Retriever: The retriever is a system designed to search a database and find documents that are similar to the user’s query. It retrieves information that is contextually relevant to the user’s query.

Example: Suppose you’re building a trend report generator. For a query asking, “What were the trends on the stock market last week in the tech sector?”, the retriever would pull relevant sections from stock market reports or news articles that detail the tech sector during that period.

Generator: The generator is an LLM like GPT-4, responsible for creating a natural, eloquent, and comprehensible text. Using a prompt that combines the original query and the retrieved documents as input, generates a contextually relevant response.

Example: The generator using a ChatGPT-4 model crafts a textual response about the trends of the stock market by utilizing the specific information retrieved.

How They Work Together: The retriever’s output — relevant documents — is combined with the original query and fed into the generator. This ensures the final response is both accurate and aligned with the user’s request.

Now we can make sense of the name Retrieval Augmented Generator. RAG is a generator that receives aninput that is augmented by adding relevant information selected by a retriever. This sounds sophisticated, but it is essentially just describing a way of feeding an existing LLM with relevant context-specific information to provide a better result. Before we go into the depths of RAG — let’s examine in more depth the downfall of alternative solutions and the need for RAG.

Why do we need RAG?

Consider the example from the introduction: you are tasked with building a customer support system for a tech giant to answer a query like “What products do you recommend based on my purchase history?”. Why is it not enough to ask the question directly to an LLM like ChatGPT?

Why do we need a better prompt?

Experiment: ChatGPT’s response to a question that requires private information to answer appropriately. Therefore, when asked for recommendations based on a customer’s purchase history, LLM responds that it cannot answer this inquiry. (The response was received 08/2024)

Vanilla LLMs have several shortcomings, such as a lack of contextual depth. They tend to respond with generalized answers based on a very broad body of information. Furthermore, LLMs do not have access to custom, private, or up-to-date information. Therefore, if one would pose a query like “What products do you recommend based on my purchase history?” the LLM would either come up with a potentially wrong generic answer — so-called hallucination — or not answer the question at all (as seen above in the picture when we asked ChatGPT 4o, 08/2024) because the LLM does not have information about the customer’s purchase history.

What is the key missing factor here? Context. With additional information, the LLM could provide a better answer. There are two direct ways to steer an interaction with an LLM: 1) through its input, or 2) by changing the LLM itself. The latter corresponds to re-training the model. In our example, this means we would have to fine-tune the model with the company’s entire knowledge base, and every time this base changes — a Herculean task that is not only costly and inefficient but also not providing good results: Fine-tuning is an excellent choice if you want to change the tone of the answers of the LLM, e.g., answer like Shakespeare, but not if you want to base the answer on selected facts. Instead, the better approach is to focus on the input. If we could provide the LLM with information via the prompt — such as documents about the product portfolio — we could steer the answer to be based on this information. This is what RAG systems do. They follow this simple idea:

Relevance in, Relevance out: If you want more relevant results, you need to input more relevant information.

How is the prompt improved?

Manually pre-selecting information is tedious and impractical, and we naturally want to avoid it. So why not add all the information to the LLM’s input? For the tech giant’s chatbot, this means we could use a long context LLM and add all the documents needed for any customer inquiry, e.g., product documentation, customer information, user manuals, troubleshooting manuals, account management, order, billing information, FAQs, and more. This way we avoid the time-consuming manual search for relevant information and still get relevant answers, right? Not quite. Despite the continuous growth in the context size for LLM inputs that allow processing larger volumes of data, this approach leads to poor performance and scalability issues. Evidence shows that RAG-based models outperform even the most sophisticated long-context LLMs (e.g., Claude Sonnet), especially in handling complex questions requiring multiple steps of reasoning. Moreover, the scalability of long-context LLMs is limited by token capacity, which is too low to handle any relatively large database. For example, the Gemini model, with its “1.5M token capacity”, restricts practical usage to 128k tokens per query [1]. To put this in perspective, 128k tokens roughly correspond to about 96,000 words.

The trick is not to pass all information but “the right” information. To obtain the “right” information for diverse queries without manual selection, we need a dynamic selector: a retriever. In the context of RAG, the “right” information is data that is most relevant and contextually appropriate to the query.

And that is retrieval augmented generation — a system to dynamically enrich a prompt with contextually relevant information.

Reverse Engineering RAG

To obtain an accurate, context-specific response, we need an LLM equipped with a dynamically enhanced prompt.

PROMPT_TEMPLATE = """
You are an assistant answering questions. You are provided with a question and context. 
Answer the question based on the context.

Question: {query}
Context: {retrieved_documents}
"""

FILLED_PROMPT = """
You are an assistant answering questions. You are provided with a question and context. 
Answer the question based on the context.

Question: What were the trends on the stock market last week in the tech sector?
Context: 
- Document 1: "Nvidia's stock surged by over 10 percent last week, driven by strong demand for its AI-driven GPUs, which are essential for the ongoing AI revolution. Analysts noted significant institutional investments due to the company's leadership in the AI hardware market."
- Document 2: "Last week's stock market performance highlighted that major tech stocks like Apple, Microsoft, and Amazon saw a 5-7% increase due to strong quarterly earnings and positive market sentiment."
- Document 3: "AI-related tech stocks, with companies like NVIDIA and Google seeing double-digit growth, driven by increased demand for AI-driven solutions and positive earnings reports."
"""

How is this achieved? For RAG, it is the most common to use the prompt template as above. It contains three types of information:

Instructions: The guide for the LLM on how to respond. One can specify the tone, detail, or priorities. For RAG, one typically instructs the LLM to leverage the context to answer the query.
Query: The user’s input — the question or issue they’re seeking help with.
Context: The relevant documents retrieved provide the factual foundation for the LLM’s response.

For each query, the system dynamically reconstructs the prompt by integrating these components. This ensures that each response is tailored to the specific query and is informed by query-specific contextually relevant documents.

Now there is one puzzle piece missing: how these documents are retrieved. This is where RAG’s true magic lies.

Retriever — magic behind dynamically retrieving the relevant context

The three components of a retriever: An embedding Model, A (Vector) Database, and a Similarity Search Engine.

The retriever component in RAG systems is the key to ensuring that queries are enriched with contextually relevant information. It searches a large corpus — the database — to find information that semantically matches the query. This is achieved using three key components: an embedding model, a vector database, and a similarity search engine.

Embedding Model. Comparing textual information according to their semantic similarity is not straightforward. Therefore, documents and queries are transformed into numerical representations to allow for mathematical operations. One uses an embedding model to encode both queries and documents into vectors, representing the text’s essence and semantic meaning. This can be done with models like BERT, or Dense Passage Retrieval (specifically designed for better retrieval using two separate BERT-based models for the query and documents).

*The embedding model transforms textual information into numerical representations (vectors).*

Similarity Search Engine. The Similarity Search Engine compares the query vector with document vectors to find the best matches and retrieve the most relevant documents. This is where the embedding space comes into play since it locates semantically similar embeddings close to each other. This enables the use of geometric similarity measures like the cosine similarity. Cosine similarity corresponds to the cosine of the angle between two vectors, indicating their directional alignment. This ensures a fast and accurate retrieval of contextually relevant information in RAG systems.

Cosine similarity uses the angle between vectors to determine semantic relevance. A smaller angle (cos(θ) ≈ 1) indicates a relevant document, while a larger angle (cos(θ) ≈ 0) suggests irrelevance. For RAG we compare the vector of the query VecQ with the the vectors of different documents. The vector VecR encoding last week’s development of NVIDIA is relevant for the query and therefore shares a low angle with VecQ. The vector VecIR, which represents a document about Tesla stocks in 2020, is not relevant and shares a wide angle with VecQ.

For example, with the query “What were the trends on the stock market last week in the tech sector?”, a retriever using cosine similarity would select documents closely aligned in the embedding space. Relevant documents might include those about tech companies like “NVIDIA” (VecR in the diagram), recent tech trends, or last week’s performance data for top tech companies. These would score a cosine similarity close to 1. In contrast, unrelated documents, such as those about other sectors like “Tesla” or outdated reports (VecIR in the diagram), would score closer to 0, as embedding vectors are typically non-negative.

Vector Database. Efficient, search-optimized storage is crucial for document retrieval. Vector databases are ideal for this, as they excel in managing dense vector representations optimized for similarity searches. They handle high-dimensional vectors using techniques like efficient indexing strategies (e.g., HNSW), Approximate Nearest Neighbor (ANN) search, partitioning, and hashing to group similar vectors and enhance search performance. Cloud and infrastructure optimizations further boost scalability and speed. Moreover, these databases allow rapid updates, making RAG models scalable and efficient at handling a growing or dynamic knowledge base. Additionally, using this closed container approach gives control over the used knowledge base. One can select and maintain the sources used, ensuring that the content is always up-to-date and relevant. This feature is responsible for ensuring that responses are reliable and of much higher quality, making RAG a powerful approach to delivering accurate and trustworthy responses to user queries. Note that a RAG system does not necessarily have to use a vector database for storage, but it is the most common choice.

Visualization of the Vector Space in 3D. This figure illustrates a 3D representation of the vector space used in the RAG system. Each point represents an embedding of a document within the vector database. The proximity of the indicates relevance and similarity, showcasing how embeddings are searched and matched.

Diving Deeper — RAG Working Mechanism and Architecture

There are two important perspectives you need to know about: 1) the usage and 2) the developers. We first look at what happens when a user inputs a query, and in the next section, we focus on the developer’s perspective.

The End-to-End RAG System Workflow: This diagram illustrates the flow of a Retrieval-Augmented Generation (RAG) system. Starting with a user query, it is encoded into embeddings, searched in a vector database relevant, and documents are retrieved and combined with the query. Finally, the Large Language Model (LLM) generates and returns a contextual response.

User Submits Query: A user inputs a query into the system.
Query Encoding: The system encodes the input query into a dense vector with an embedding model such as BERT or Sentence Transformers. This transformation is a necessary step as it converts the textual query into a numerical format that can be easily compared against other encoded documents.
RAG System Queries Documents: The encoded query is used to search for relevant documents within the system’s database.
Document Retrieval: The system compares the query vector to the document vectors stored in the vector database and retrieves the top relevant documents (in the form of embeddings) or passages based on their similarity scores to the query. This step ensures that the most contextually appropriate information is selected for the next stage.
Combine Query and Documents: The retrieved documents or passages are combined with the original query to form a rich, contextually enhanced input. This combination typically involves concatenating the text of the query with the text of the retrieved documents, creating a single input.
LLM Generates Answer: The combined query and documents are sent to a Large Language Model, which generates an answer.
RAG System Returns Answer: The generated answer is returned to the user through the RAG system.

The RAG Pipeline

To implement RAG, it is crucial to understand every step in the pipeline. This involves more than just the retriever and generator components — document pre-processing, storage, and post-processing for retrieval and generation must be thoroughly executed to ensure accurate answers.

1. Data preparation and storing

Knowledge Source. Identify and preprocess the knowledge source. This could be an article, a collection of documents, or a database. The additional information collection forms the great strength of the RAG and can contain domain-specific knowledge (e.g., a medical database), custom information (e.g., a company’s product features), or more up-to-date data (knowledge published after the pre-training time of the LLM, e.g., real-time financial data, etc.). To improve the results, ensure that the data is clean and of high quality. For this, remove irrelevant or redundant information and standardize the format.

Chunking. Split these documents into smaller sections called chunks for more precise retrieval. This can be done by segmenting the documents based on logical units such as pages, paragraphs, sentences, or thematic sections to preserve context and relevance. By using chunks instead of full documents, the retriever can focus on the most pertinent information, reducing noise and enhancing retrieval precision. Optionally, chunks can be enriched by additional information. For example, adding metadata, summaries of the document, keywords, etc.

Creating Embeddings. Transform each chunk into a dense vector using an embedding model, such as BERT or Sentence Transformers. These vectors capture the semantic meaning of each chunk, allowing for efficient similarity searches.

Storing and indexing. Store these embeddings in an appropriate storage system. A common practice as discussed above is a vector database. Weaviate, ChromaDB, FAISS, Milvus, Pinecone, and Qdrant provide the current best-performing vector database that supports quick and accurate retrieval of relevant chunks based on their vector representations and can be enhanced by different searchable indices and fast retrieval methods.

Example: For the trend report generator, we use financial news articles, market analysis reports, and historical stock data as a knowledge source and remove advertisements to clean the texts. We can chunk the documents into chapters or thematic sections since the task requires bigger text snippets than just a single sentence to provide enough context for the report. To better identify the most important articles, we include summaries of each document in each chunk. Then, we encode the documents using Sentence Transformers and store the chunks in the popular vector database FAISS (developed by Facebook AI), which is known for its high speed and performance.

2. Retriever Component

Query Encoding. Convert the user’s query into a vector representation using an embedding model such as BERT or Sentence Transformers which is usually the same as the one used for the document chunks. This representation captures the semantic meaning of the query, which is essential for effective retrieval.

Search and retrieval. Use a retriever to search an indexed vector database. The retriever component identifies and returns the top n most relevant chunks related to the query. This ensures that the information fed into the generator is contextually appropriate and accurate. The retrieval can be based on the semantic meaning (Semantic Retrieval), exact term matching (Lexical Retrieval), or even both (Hybrid Retrieval, which can be part of a complex RAG — we will talk about this type of RAG later).

Retrieval Enhancement. Optionally, enhance the retrieval. This might involve:

Re-ranking: Adjusting the relevance ranking of retrieved chunks based on additional criteria or user feedback. Highly recommended!
Enriching: Adding additional contextual information or data.
Summarizing: Condensing long retrieved chunks into more digestible summaries.
Filtering: Removing any redundant or less relevant chunks to focus on the most pertinent information.

Example: For this task, we would use the same embedding model. The FAISS vector database provides a retrieval mechanism utilizing efficient similarity search. Finally, we would enhance the retrieval by summarizing these chunks to emphasize key trends and refining the relevance ranking to ensure the most pertinent information is highlighted.

3. Generator Component

Prompt Construction. Augment the prompt used for the generation with the retrieved information. This is done by combining the query with the retrieved data to obtain a rich, contextually enhanced input for the LLM. One common way is to provide a prompt template that specifies a query and a context variable (an example of that can be found in the following). The context variable corresponds to a list of the most relevant document chunks. The prompt is then used as an input for the LLM. This method also offers the possibility of further prompt engineering. There are many other ways to combine the two components like using a query engine, which will take over the construction of an appropriate prompt. All of them ensure that the generative model is grounded in the most relevant information — the retrieved data.

PROMPT_TEMPLATE = """
You are an assistant answering questions. You are provided with a question and context. 
Answer the question based on the context.
Question: {query}
Context: {retrieved_documents}
"""

Text Generation. Use a generative model, such as OpenAI’s GPT-4, Google’s Gemini, or Meta’s Llama 3, to produce the output based on the augmented input. For sensitive data, such as confidential company information, we recommend using a local model like Meta’s Llama model. These models can be hosted on your infrastructure, ensuring that sensitive data remains decoupled from global servers and reducing the risk of data exposure.

Validation and filtering. Validate the generated output for accuracy and coherence. Filters are implemented to remove any irrelevant or incorrect information, ensuring the quality of the final output.

Refinement. Optionally, apply additional refinement steps. This may include re-ranking or re-generating the responses or integrating feedback mechanisms to continuously improve the system’s performance.

Example: We have already seen what the prompt looks like for a potential query above. Regarding the generator model, since no sensitive data is used, we can use OpenAI’s GPT-4o model. We can then validate the results by cross-checking key statistics and facts against the original data sources. If the results are not formatted as expected, apply another LLM layer into the system by appropriate instructions and use the first result as input.

In conclusion, developing an effective chatbot or text generator requires more than traditional LLMs. Long Context LLMs or Fine-tuned Models still have their limitations in accuracy, scalability, context specificity, and updating capabilities. RAG systems overcome these challenges by enhancing LLM inputs with relevant information, enabling accurate, and context-aware responses.

RAG framework is defined with two key components: the retriever, which searches and retrieves relevant information, and the generator, which produces contextually enhanced responses. RAG’s true power lies in its ability to dynamically retrieve semantically related documents, providing accurate and relevant context for generating text.

This article serves as the starting point, equipping you with foundational knowledge to delve deeper into our RAG series and practical implementation guides.

RAG developers can create intelligent, adaptive, and scalable GenAI systems, transforming how information retrieval and generation are approached in AI.

[1] RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension