RAG for Quality Engineers

Building RAG is easy, building quality RAG is hard

Published in

Slalom Build

15 min readMar 28, 2024

Retrieval-augmented generation (RAG) has become a common pattern for extending the capabilities of large language models (LLMs).

RAG is simple in theory (just add data to the context window!) but complex in practice. Hidden beyond the box diagrams are advanced chunking strategies, reranking, multi-query-retrievers, small-to-big retrieval, hypothetical document embeddings, pre-embedding data enrichment, dynamic routing, custom embedding models … and on and on.

While setting up an initial pipeline could be fast and easy, getting to production-level quality is significantly more complex. Without careful consideration, RAG systems can return incorrect, irrelevant, or inconsistent information; they can struggle with poor performance, inefficiently consume expensive resources, or choke when scaling to production-scale source data.

Understanding how to effectively and efficiently evaluate the quality of RAG systems requires understanding how all the individual pieces work together to create a full RAG pipeline. Design decisions for each of these pieces can impact quality and should be understood by everyone attempting to deploy RAG applications.

Below is an introduction to RAG concepts and patterns from a testing and quality point of view. We will start with an introduction of why RAG is valuable, and then discuss how the many design decisions inherent in building production-quality RAG seek to improve it. This intro will provide a necessary foundation before we can discuss specific evaluation approaches and techniques for RAG systems.

The Limitation of LLMs

In order to understand RAG pipelines, we should first step back and understand the limitations of LLMs that RAG seeks to address.

At their core, LLMs are simple: you send them a prompt and you get back a response.

To return a response, an LLM must run an inference calculation against a model. This calculation involves combining the inputs with many millions or even hundreds of billions of parameters that define the model. This is an expensive computation.

As expensive as calling an LLM is, training an LLM is orders of magnitude harder. Training is the process of determining the best values for the parameters within the model. There are different algorithms used to calculate the best weights, but all involve an iterative process of running the model on a given input, calculating an error, and then back-propagating a correcting adjustment such that the answer would be slightly better. This is done many, many times against many, many inputs, and eventually you get a trained model. While model inference can take seconds, model training can take weeks, even on massive clusters of GPUs.

A single NVIDIA H100, going for about $40,000 USD (Q1 2024)

The huge training cost creates a bottleneck for incorporating new or updated information into an LLM. Most companies do not have the resources to train models and cannot simply just “add new information” to an LLM by training with private data. Instead, large well-funded technology companies train general purpose foundation models on large, public data sets, and these models are augmented with new abilities and information with secondary processes like RAG.

Specifically, RAG seeks to give LLMs access to large quantities of additional knowledge in a way that circumvents the prohibitively expensive process of training new models.

RAG Basics

Let’s get into how RAG actually works.

The prompts sent to an LLM have limited length, called the context window. Context windows are measured in tokens (for our purposes, about equivalent to words). Context windows usually come in sizes like 1K, 4K, or more tokens, although much larger context windows are becoming available (example: Gemini 1.5 Pro with 128K).

Many people intuitively think that context windows are simply the longest question you can ask, but this is a limiting way to think about context windows. Because of the way LLMs work, information provided in the context window is available to the LLM as it generates a response. Thus, it can be used to provide additional information. This is generally called in-context learning.

Thus, we can use the context window to provide new knowledge necessary for the LLM to answer the question. For example, we could create a prompt that asks about a company’s policy on bereavement, and then stuff the entire company handbook into the prompt (including sections on bereavement), as long as it fits within the context window. This method would allow the LLM to respond using the new information provided.

This solution is simple when we already have the relevant information and that information can fit within the context window. Unfortunately, this is not always the case. Thus, we need some mechanism to retrieve and down-select only information relevant to our prompt.

A naive approach would be to do a keyword search for terms in the prompt across the entirety of data that could be relevant, copy text surrounding the hit, and then add this text to the prompt.

This simple form of keyword search RAG could improve the LLM response and may be useful in some contexts, but might also suffer from false positives (keywords used in different contexts). Luckily, we can do better by leveraging semantic search where we match on the meaning of the text, not the raw words.

Specifically, we can leverage embedding models to create embeddings from chunks of the possibly relevant data, and then perform searches across these embeddings to find data that is relevant to our prompt. This approach is highly simplified, but it is starting to look like real RAG.

RAG and Embeddings

Understanding the benefits RAG provides over simple keyword searching requires understanding the purpose and nature of embedding models. This is a deep topic in itself, but it is critical to understanding RAG.

Embedding models are similar to our original LLM, but instead of generating new content, they reduce input to a vector (just a list of numbers). For embedding models, they are very big lists of numbers. The vectors that embedding models create are usually 768 or 1536 numbers (dimensions), but vectors of other sizes also exist.

The vector created by an embedding model isn’t just a random set of numbers; it is a distillation of the meaning of the input data according to the model. The vector has no meaning to other models, but “similar” text will create similar vectors in the same model. Similar is more than just “has the same keywords”—embedding models are specifically trained to distill deeper, semantic meaning from unstructured data. For example “Guy horses do not fly” and “A fly guy horsing around” will not be close vectors despite having similar words.

The great thing about vectors is that you can perform math on them. Fast math. It is possible to search across many millions of vectors to find similar vectors in reasonably small amounts of time. (Here are some of the algorithms employed.)

Now we have the pieces of our RAG pipeline, so let’s walk through the steps.

The first four steps are done once, or once and updated as source data changes. Steps five through eight are done for each inference request:

We collect all our possibly relevant data—so much data that we couldn’t possibly fit it in the context window of our prompt.
We chunk this data up into smaller pieces (more on this later).
Then we run each chunk through an embedding model to create a vector that encapsulates the meaning of the chunk.
We save the vector into a vector database.
On each inference request: When we get a prompt, we run that prompt through the same embedding model as the chunked source data to produce another vector (called prompt vector or query vector).
We search our vector database for vectors that are similar to our prompt vector. The vectors returned will be better matches than if we had just keyword-searched the raw data.
We (optionally) rerank the identified relevant vectors, and then return the raw data of each of the top vectors.
The raw data is combined with the initial prompt and sent to the LLM.

Basic RAG: Green are pre-processing steps done, blue are done on each inference

Voila—our LLM now behaves as if it were trained on all the new, proprietary data we fed into our vector search, without having to perform the prohibitively expensive training of a foundation model.

At least, this is how it should work in theory. In practice, this overly simplistic pipeline will probably not meet your production needs, and you will need to adapt, improve, swap, or expand different parts to meet the needs of your specific application before you get to a quality, production-ready RAG pipeline.

RAG Design and Quality

The above introduces RAG, but as we said before, RAG in practice can be significantly more complex, and these real-world complexities can impact application quality. Let’s walk through each step to understand some of the implementation challenges, quality risks, and possible alternatives available within the RAG pipeline.

#1—Sourcing Relevant Data, Ingestion, and Enriching

Starting from the beginning, we must find all the “possibly relevant data.”

The single icon (#1) in the RAG diagram above is more likely to be an entire data pipeline (or set of pipelines!) that ingests data from multiple sources; stages it; curates it; possibly transforms, anonymizes, and tokenizes it; and performs other processing actions common in data pipelines.

Some of these pipelines can get very complex, especially if the raw data is in formats other than text. For example, some pipelines make extensive use of OCR technology to ingest large amounts of scanned physical documents.

With all the complexity of a data pipeline comes all the challenges of testing data pipelines.

The most well-implemented RAG pipeline will fail miserably if the source data isn’t even making it into the vector DB, and depending on the variety, velocity, and volume of this data, this ingesting phase of RAG could be complex and the source of many application quality issues.

In addition to the normal data pipeline activities, RAG can benefit from data enrichment. Often, other systems (or people) have context about source data that could be enormously beneficial for evaluating its meaning. For example, a customer database could be enriched with tags or annotations from other systems that add pertinent information. Oftentimes, other generative models or NLP are used to create cleaner or summarized metadata. Think of all this as “pre-processing” before embedding generation, and if done right, it can significantly increase the retrieval quality.

If you are evaluating the quality of your RAG retrieval system, it is well worth your time to understand how the data is sourced and ingested before it ever hits the fancy AI parts of your RAG pipeline.

#2—Chunking

After data is ingested but before it can be run through an embedding model, it has to be split up into discrete pieces. So, how do you decide how to split the data? This is called your chunking strategy.

How large or how small of chunks are optimal? Should chunks overlap? Are there smarter ways to chunk than just dividing by page, paragraph, or fixed length? How should data in non-standard formats (code, JSON, etc.) be chunked?

These are the questions that a chunking strategy tries to answer, and there is no perfect solution. Different strategies have different trade-offs. Some are simple and fast to implement, giving passable results. Some are more complex and involved, and they can provide better hit rates and LLM response quality. Chunk your data too coarsely, and you may stuff your context window with irrelevant data, crowd out other relevant chunks, or create embeddings that are too generic to obtain meaningful matches. Chunk it too finely and you may “clip off” relevant data.

This article explores five categories of chunking: fixed-size, recursive, document-based, semantic, and agentic (using AI to chunk, nifty!).

There are many other approaches that can be used to optimize chunking. For example, in small-to-big retrieval, small chunks are used to search, but each chunk is linked to a larger parent chunk that is retrieved to insert into the context model. Context-aware chunking uses existing knowledge about the nature of the documents to intelligently split them into logical chunks.

This list is probably not exhaustive, but shows the diversity of options available to RAG implementers and the importance of an appropriate and tuned chunking strategy to the overall quality of the application. The Pinecone blog has more detail on many of these strategies.

#3—Embedding Model Choice and Configuration

There are many models that can be used to generate embeddings, and different models will perform better or worse in different situations. Some models come pretrained for general use, and some are fine-tuned for specific domains (i.e. medical records). It is also possible to fine-tune your own embedding models for the specific data handled by your application.

In addition, many models come in different sizes (which affects cost and time of embedding generation), different input length (the maximum chunk size that it can handle), and different output vector dimensions (higher dimension = more accurate but more space requirements and slower).

Some embedding models can only be accessed via an API (for example, OpenAI embeddings endpoint) while others are fully open sourced and can be downloaded and run locally or hosted in a cloud provider.

It is also possible to use different embedding models for different data paths within your application.

While a generally good embedding model may be sufficient for many RAG applications, others may benefit from a specific embedding model or a custom-trained model.

Knowing the design considerations that went into your embedding strategy and the quality characteristics of that choice will provide insight into the evaluation needs and approach of your application. For additional reading, here is a deeper discussion on evaluating embedding model choices.

#5—Query Processing and Embedding

There is no rule saying you have to run the embedding model on the incoming query exactly as you receive it. In fact, there are a lot of ways you can optimize this query and the resulting embedding search to improve the overall quality of your application. This is even more true if the query is coming directly from a human user who may have written a vague, whimsical, ambiguous query.

With some additional knowledge about the nature or intent of the application, it may be possible to use an LLM or traditional logic to reduce or rewrite the query in a more compact and explicit way, i.e., rewrite the query to be what was intended, not what was actually asked.

An advanced form of query processing is HyDE, creating a hypothetical document of answers and vector-searching for similar documents (answer to answer) rather than embedding and searching on the query (question to answer).

Another option is to split the query into multiple related queries and run each of these in parallel, combining the results—the multi-retriever pattern. This option obviously comes at a processing cost, but it can improve retrieval quality.

Depending on the specifics of your use case, custom query processing may be warranted and can significantly impact the quality and behavior of your application.

#4 , #6—Vector DB and Vector Search

While vector search is fast, searching a vector DB for embeddings similar to the query still has a time (and possibly money) cost. One approach to minimize this cost is semantic caching. In semantic caching, after an embedding is initially retrieved, the response is cached, so similar searches in the future return data directly from the cache.

Of course, caching adds complexity (and is one of the two hard problems in computer science—I can’t remember the name of the other one). While caching can improve performance, a stale cache can be detrimental to response quality, especially in environments with volatile source data.

#7—Reranking

In our description above, we naively assumed that we could stuff our context window with all relevant data returned by our vector search. Obviously this is a simplification, and there must be some process to determine which of all returned vectors should be prioritized for inclusion in the context window.

Even when we can fit the search results into the context window, many studies indicate that context stuffing (filling up the context window) can negatively impact LLM recall by introducing lost-in-the-middle problems, and thus response quality (recall is the ability of an LLM to use information in its context window).

The solution is to add reranking as an additional step after the initial vector search.

The TLDR of reranking: embedding models are optimized for speed, as they need to be run against a large number of documents. Other models called reranking models (or cross-encoders) are slower but optimized for accuracy. So a fast-inaccurate embedding model is used to generate the embeddings (that are saved in the VectorDB), and then the slow-accurate model is used to find the highest quality documents in this smaller set. The best matches from the slow-accurate search are prioritized in the context window.

Again, there’s a lot more to it than that, but this is the essence of reranking. The Pinecone blog has a great description of the process in more detail.

Reranking can significantly improve the relevance of the data returned by RAG, a measure of retrieval quality. More relevant (or less irrelevant) data in the context window will improve response quality. While it adds complexity and latency, the quality trade-off might be valuable in many RAG applications.

Large Context Windows vs. RAG

We are finally at the point of calling the LLM, but before we talk about prompt engineering, we should take a moment to mention the relationship between RAG and large context windows.

LLM technology is evolving quickly, and one dimension of improvement is the size of the context window. A prime example is Gemini 1.5 Pro, released in February 2024 with a 128K context window and (not publicly released) option of going up to one million (!!!) tokens.

Some initially speculated that a one-million-token context window would render RAG pipelines obsolete, but this is not the case. This blog explains why RAG is valuable (and even required) even when using models with huge context windows (spoiler: cost, latency, and recall quality).

Large context models are valuable and can help LLMs respond to queries that require synthesis across a large number of facts (which may or may not have been down-selected via RAG).

The relationship between large context windows and RAG will continue to evolve, and RAG implementers and testers should understand these trade-offs and their impact on application quality.

#8—Prompt Generation

You get a bunch of relevant data back from your VectorDB, rerank it, and finish with a nice set of relevant data that fits within your LLM’s context window. What now? Do you just shove that data into the end of the prompt with the initial question and call it good?

As anyone who has worked with LLMs can tell you, there’s a lot more nuance to it than that. LLMs can be powerful, but they can also be fickle and frustrating. It turns out that small details in your prompt can significantly impact response quality. How you word your prompt, the order of data, the tone you use, suggestions like “take your time,” and even using emotional language can all impact LLM response quality. There are strategies for auto-generating optimal prompts using … you guessed it, other models specifically trained to generate prompts. This is all part of the rapidly evolving field of prompt engineering.

The precise prompt template that will generate the highest quality response is usually model- and application-specific and often requires some trial-and-error experimenting. Given the quality implications of this seemingly tiny detail of RAG, the specific prompt engineering employed should be as heavily evaluated and vetted as any other part of the system.

Measuring and Evaluating RAG Systems

We have walked through the major pieces of RAG pipelines and (briefly) discussed their impact on application quality. This was an introduction but should provide insight into the inner workings and quality challenges of these types of applications. There are a lot of great articles, blogs, and papers that go even deeper into RAG. If you start with just one, read Retrieval-Augmented Generation for Large Language Models—A Survey.

The key takeaway: there are a huge number of options and choices when implementing RAG, each with trade-offs and quality implications. Some of these choices can be evaluated directly, some as they impact overall retrieval or response quality. Understanding each of these choices and how they may impact your RAG system is critical for achieving production quality for your overall application.

The obvious next question is: OK, but how do I evaluate RAG? How do I measure the quality of an open-ended free-form response? What can I actually measure, using what metrics? How can these evaluations be automated, and at what level? How do I ensure quality when LLMs are inherently nondeterministic and the data they are consuming is inherently volatile?

These are big questions with fun answers. We will need to get into topics like model evaluation with frameworks like ARC and HellaSwag, approaches like LLM-as-a-judge, tests like the needle-in-haystack test, metrics like perplexity, faithfulness, and relevancy, and tools like Ragas and LlamaIndex.

But, all this fun will have to wait for the next blog.

Special thanks to Etienne Ohl and Jack Bennetto for technical feedback on this article.