The path to a golden dataset, or how to evaluate your RAG?

Published in

Data Science at Microsoft

13 min readMay 21, 2024

By Sandip Kulkarni and Alexandra Savelieva

In this article we talk about out how to generate silver and transform it to gold. Sound like alchemy from Middle Ages? In a way it is alchemy, except that today we speak not about precious metals, but of datasets and the use of Artificial Intelligence (AI) to help transform it. Please join us as we explore one of the key problems for evaluation of Retrieval-Augmented Generation systems (RAGs): Curation of a trustworthy benchmark.

*Image by the author (generated with* DALL·E 3).

Introduction

Retrieval-Augmented Generation (RAG) has become a popular technique for domain-specific Generative AI systems. A RAG system dynamically retrieves relevant context from external sources, integrates it with user queries and system prompts, and generates responses using a Large Language Model (LLM). Key components include a database (for external data to supply as context), the LLM, an embedding model, and an orchestration tool. Within the database is a retrieval index that serves as a structured data representation facilitating efficient retrieval of relevant information for subsequent generation by a language model. It acts as a bridge between the information retrieval system and the Large Language Model (LLM), ensuring seamless communication and optimized performance. RAG excels in dynamic data environments by continuously querying external sources, ensuring up-to-date information without frequent model retraining.

Unlike the evaluation of LLMs, where one can rely on the abundance of open source datasets, RAG projects are usually highly specific to their domains. In many scenarios RAGs are built on data that is not public, such as a virtual assistant for employees for a company that helps with answering questions on enterprise data. Moreover, the characteristics of data that is being retrieved, such as structure (book or set of articles), style (verbose or succinct, scientific or informal), modality (text/ image/table format) and other aspects have significant impact on the performance of retrieval and generation of responses.

The prerequisite for evaluation of RAGs is a custom benchmark. The benchmark structure is a set of samples, where each sample has a question, an answer, and (optionally, but highly recommended) a reference to the source. Usually about 100 QA samples is considered a reasonable number of questions to provide enough diversity for evaluation without overwhelming resources. The focus of this article is on practices and tools for creating such a set of questions.

Problem

Establishing a high-quality benchmark of sufficient size is a recurring challenge in every customer project or real-world application. It takes human experts up to a few work days — for example, in a recent three-week project for prototyping RAG it took between eight and 10 days for subject matter experts on the customer side to come up with 10 question-answer samples in two cases that our team worked on, and about two weeks for approximately 90 QA pairs in our most recent two-month engagement.

Once provisioned, a benchmark also needs a review for quality issues and correction, and that takes substantial time. If this step is skipped, results of RAG evaluation may be inaccurate and misleading. Issues are almost inevitable, because with limited attention, resources, and multitasking, people are prone to producing data with inconsistent quality. For example, one kind of issue we encountered was when humans copy/pasted the same answer to variations of questions on one theme, but with different angles (what versus how). Humans also have “tribal” knowledge and may provide some sources for the answer that are either not relevant or not sufficient, among others. There may be questions without answers, or answers that contain only a link to the source without actual content to use as “ground truth” at the evaluation stage. Therefore, it is important to have methods for producing and verifying benchmark quality quickly and accurately at early stages in the project to allow time for revision.

As part of helping Azure Data customers bootstrap their RAG projects, we have repeatedly faced challenges during this stage that have led to delays in project execution. In order to accelerate future work and empower customers to implement Generative AI–based solutions for their business, we have explored different solution approaches and developed practices that we share in this article, along with code to generate benchmarking questions and answers for evaluation of custom RAGs. Thanks to the help of subject matter experts, we are pleased to see that the quality of the benchmarking datasets generated with our tool is acceptable for practical use and comparable to that of human-curated datasets.

In this article we also elaborate on the automation of analysis of such benchmark sets, demonstrating how to use AI to verify the benchmark. The first stage (“synthesis”) results in creation of a so-called “silver dataset,” and its refined version after analysis becomes “golden.”

Prerequisites

To replicate the work described in this study for your scenarios, you must have GPT endpoint and a compute resource with file storage for data (e.g., an Azure VM or a dev machine). The code is available on GitHub in this location. The code is in the form of a Jupyter notebook, so you also must have the tools for running it on your machine (see https://jupyter.org/install).

Data

Customer projects have been extremely instrumental in helping us understand the problem and developing the solutions covered in this article. However, benchmarks from these projects are not provided here for confidentiality reasons — we share only general insights derived from working with them. For the case study described in this article, we used publicly available data on Microsoft transcripts available here: Microsoft (MSFT) Q4 2023 Earnings Call Transcript | The Motley Fool. The size of the source data is relatively small, but it is convenient for the purpose of experimentation and the structure is indicative of a realistic use case for an unstructured dataset in an advanced domain.

Creating a “silver” dataset

Here are the steps for generating a question-answer benchmark set, also shown in Figure 1:

Collect data and store it to data storage.
Using Azure AI Document Intelligence or a similar tool, extract important variables (e.g., file identifiers and content-specific features) and include location information (e.g., lines and page numbers) to a table for citation purposes. These can be used for filtering manually, and can also be passed to the model as part of the context to simulate some desired properties of the QA samples distribution, such as “at least 10 questions for each transcript,” “uniform distribution of questions across years,” “coverage of all pages of the documents,” and more.
Embed the text chunks into vectors if required: This is what will be used later for regular retrieval-augmented generation applications from databases. Optionally, create and save results of the intermediate step in the appropriate format.
Select the variables mentioned in point 2 above, and associate them with sample chunks that are within this context. We picked two chunks for this use case (based on dataset size and for demonstration purposes).

Now do the benchmark question-answer set generation. It is beneficial to do so at this stage, because doing search from a database has added cost.

Write a prompt template for generating questions and answers based on the selected chunk and important filter parameters.
Pass the selected chunks as context and important filter parameters to the model as content, as they are part of the template format.
It is possible to ask GenAI to generate more than one question-answer pair for a given combination of chunks. Doing so saves costs on endpoint calling. However, this is a tradeoff with coverage, so you can decide what you want to optimize for.
We highly recommend preserving context location data (e.g., source document ID) along with the generated QA pair. This allows you to automatically calculate a retrieval success metric (e.g., topN retrieval rate) helpful to human review and judgment for isolating quality issues with AI-generated answers when the benchmark is used to evaluate RAG. For example, distinguish between issues when 1) the correct context was not retrieved; 2) the source was retrieved, but necessary information to thoroughly address the question is missing or misleading, or 3) full information is present, but for some reason the LLM is still unable to answer the question.
We recommend generating about 100 questions. With more resources, expanding to a few hundred questions helps to get even more insights into system performance (note that resources are needed not only at the stage of curating the benchmark, but also when the verified benchmark is used to assess the quality of e2e RAG, as individual-generated answers with low scores help identify weaknesses of the system).
Check the generated questions and answers with the specific chunks passed and save them to a file (note: this file is then available for examination by subject matter experts).

Figure 1: Steps for question-answer benchmark set generation, from top to bottom.

Creating a “golden” dataset

The prerequisite to this stage is having a “silver” dataset. It may be generated by AI with the process explained above, or available from subject matter experts in the form of questions, answers, and source references (context). To evaluate them, we propose leveraging existing frameworks that are designed to test the quality of answers from chatbots, with the help of AI as a “judge.”

The method includes the following steps:

Aggregate the data (i.e., question, answer, and text from the referenced source).
Transform it to the format that the evaluation framework accepts (e.g., csv or json with a given naming convention such as “question”, “answer”, and “context” for AI Studio).
Run the framework on the file, using relevant metrics (such as roundedness, coherence, and fluency).
Review the results with the focus on tuples that received low scores. Produce a report for stakeholders. Iterate until acceptable quality is achieved.

Below are examples of AI-assisted metrics that are available out-of-the-box in Azure AI Studio. Similar metrics from other frameworks and tools (e.g., RAGAS) can be used instead, or you may implement them from scratch using your own prompts and a GPT endpoint. The RAGAS framework provides different naming convention for metrics and slightly different definitions. However, upon closer examination we have observed significant overlap and near parity between Metrics | Ragas and Azure AI Studio metrics. “Model answers” should be used instead of generated answers as input. Once the evaluation is complete, it is a best practice to have a human judge review samples that have low metrics, removing or fixing those that are indeed not passing the bar.

Groundedness measures how well the answers align with information from the source data (the predefined context in the benchmark). It assesses the correspondence between claims in an answer and the source context, making sure that these claims are substantiated by the context. Even if the responses are factually correct, they are considered ungrounded if they can’t be verified against the provided sources.
Relevance measures the extent to which the model’s generated responses are pertinent and directly related to the given questions. The LLM scores the relevance between the answer and the question based on the retrieved documents. It determines whether the generated answer provides enough information to address the question as per the retrieved documents. It reduces the score if the generated answer is lacking relevant information or contains unnecessary information.
Coherence of an answer is measured by how well all the sentences fit together and sound natural as a whole. It measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.

Results

In the case study, we generated 100 questions and answers looping through various chunks randomly selected within the dataset, with 10 questions at each location per pair of randomly selected chunks. A sampling of those validations can be used for further studies.

Figure 2: Examples of questions and answers from the AI-generated custom benchmark.

We saved these as .csv files and validated them with finance experts from across Microsoft who volunteered to review the quality of AI outputs. One of them (Matt Krop from Risk, Compliance and Threat Management) went through all 100 samples thoroughly: 66 of them received his unconditional “greenlight,” an additional seven he judged to be OK with some reservations, and 27 didn’t meet his bar.

For example, the QA pair “What is the trend of Xbox content and services revenue for FY23 Q3?” — “Xbox content and services revenue increased 3% and 5% in constant currency driven by better-than-expected monetization in third-party and first-party content, and growth in Xbox Game Pass” he noted “OK, but it would be good to provide some specifics about the actual dollar amount.”

Here is an example of a QA pair that “failed” the human test: “What new value was added to the E3 SKU in Q4 FY23 and how does it differ from E5?” — “In Q4 FY23, Microsoft added security and auto patching for Windows to the E3 SKU. The E5 SKU is a great value, and they landed it well, while they got some work to do in landing E3.” Notes and improvement suggestions that the expert conveyed to us include:

“It would be great to provide dollar and percent changes, e.g., sales grew $5 billion/17%. That’s more what we look at in finance. Answers like that would be super helpful.”
“The answers usually gave a correct answer, but not a complete answer. For example, one reason for a change was provided, but there were multiple reasons. For completeness, we should provide all the reasons.”

Here is an example of an incomplete answer pointed out by multiple experts: “Did MSFT meet their three-months-ago guidance for flat margins in Q1 FY23?” — “No, MSFT did not meet their three-months-ago guidance for flat margins in Q1 FY23 as margins drifted a little bit lower.” Comments on this included: “I think the answer is technically correct, but briefly restating the margin targets, and then where the margin actuals landed would be helpful.” and “I agree — consider adding some of the explanation provided as to why as part of the answer”).

One may argue that it’s questionable whether the evaluation of RAG using an AI-generated benchmark is sufficient. However, the same concern is applicable to a human-generated benchmark. The advantage of an AI-generated benchmark, in addition to the orders-of-magnitude in time reduction to curate it, is higher control over the distribution of questions over the sources. For example, it is possible to programmatically simulate uniform distribution by creating the same number of questions for each chunk on average, or “focusing” on some documents more to align with known “interests” of the users (e.g., provided by subject matter experts or insights retrieved from search logs).

We also applied the QA generator module to create a synthetic benchmark based on one of our customer’s data, for which we had a “baseline” of 88 human-curated questions. Then we ran a test of our RAG on this benchmark and compared metrics (topN retrieval and AI assisted metrics from Azure AI studio) with those achieved on a human-curated benchmark. While it is not scientifically rigorous to extrapolate this result to an arbitrary scenario, we obtained additional confidence in the approach by going through this process.

Next steps

If you are interested in exploring the path of using AI to assist you with creating golden datasets for benchmarking your RAG systems, there are several possible venues. One is to take our code, plug your data into it, modify some knobs for your needs if needed (for example, the number of questions, prompt, the approach to “pick” chunks for which questions are generated to simulate desired distribution, add a “critic” step for discarding questions that don’t meet some criteria). The main advantage of this approach is that you have full transparency and control over the process and minimal dependencies on external libraries.

If you prefer to use out-of-the-box solutions, they are available, too. For example, for projects that already leverage LlamaIndex, the implementation described in “Building and Evaluating a QA System with LlamaIndex” on Medium.com may be useful. The prerequisite is to install LlamaIndex. We tested this on financial transcripts and got a large set of high-quality questions quickly. The downside of this approach is reduced control over how questions are covering the documents and their formulation, because prompts are “hidden” within functions you call.

Another solution worth exploring for those open to taking dependencies on additional libraries and frameworks is RAGAS. The module for synthetic benchmark has logic that allows the generation questions of specific types (reasoning, conditioning, and multi-context) and users have control over distribution of such questions in the benchmark Synthetic Test Data generation | Ragas. This is a powerful concept, as it allows one to get deeper insights into the strengths and weaknesses of RAG by separately evaluating how well it performs in different scenarios. It comes with more expenses, as each question “costs” multiple calls to an LLM, but given that the benchmark is intended to be reused many times, this is “amortized” overtime.

Conclusion

We believe the main contribution of this article is to provide a process for generating a “golden dataset” of benchmark questions and answers to accelerate RAG projects from the proof-of-concept stage through to production. A majority of AI projects pass the proof-of-concept phase but stall before production. We hope the methods in this article provide useful guidance on improving confidence in LLM-based solutions, accelerating the chances they reach the desired quality and get successfully released.

Accurate and satisfactory evaluation of RAG systems is still an open area. Providing a golden dataset of questions and answers can help improve the reliability of models, but it is only useful when it’s rigorously integrated into the engineering process. Our future articles will cover practices of applying deterministic and AI-assisted metrics for building RAGs.

Acknowledgments

The authors would like to express their thanks to colleagues in Microsoft who helped us with this work. Firstly, we greatly appreciate our Applied AI team in Azure Data for continuous support and collaborative environment. Special thanks for contributions on this effort go to Journey McDowell and Hossein Khadivi Heris for reviewing early versions of the text and code and piloting our proposed approach on a customer scenario. Secondly, validation of the approach would not have been possible without a group of financial experts from across the company — including Andrew Comas, Kurt Martel, and others — for providing feedback; Matt Krop has been particularly generous with his time and dedication to evaluation the quality of AI-generated outputs. Finally, we are grateful to Casey Doyle for meticulous review of the article and enabling us to share our work with the audience of Data Science at Microsoft!

References

[2306.05685] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (arxiv.org)
Question Generation For Retrieval Evaluation — MLflow 2.11.0 documentation
Faithfulness Evaluator — LlamaIndex 🦙 v0.10.15
Microsoft (MSFT) Q4 2023 Earnings Call Transcript | The Motley Fool
Barnett, Scott, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly and Mohamed Abdelrazek. “Seven Failure Points When Engineering a Retrieval Augmented Generation System.” ArXiv abs/2401.05856 (2024) [image in the beginning of this document was adopted from this paper with modifications]
Evaluation and monitoring metrics for generative AI — Azure AI Studio | Microsoft Learn [prompts for metrics were adopted from this documentation as-is, and descriptions with modifications]
Metrics | Ragas
What is a Golden Dataset? — Klu
Dataset generation — LlamaIndex