Generate Synthetic Data to Test LLM Applications

Published in

Relari Blog

8 min readMay 7, 2024

Series of blog posts to share our perspectives on how to test and improve your GenAI application pipelines

(written by Yi Zhang and Pasquale Antonante at Relari.ai)

In this blog, we will walk through the popular topic “Synthetic Data” in the context of LLM testing and evaluation. We will cover:

Benefits of using synthetic data to test (and stress test) LLM-based applications.
How to generate and use synthetic data for testing
Examples of RAG and Agent synthetic test data (you can try for yourself)
Challenges with synthetic data

Note that synthetic data also plays an increasingly important role in model training / fine-tuning processes — a fascinating topic which we will explore in separate articles.

Why use Synthetic Data for testing?

Data-driven evaluation is critical to get high-quality, consistent, and comprehensive assessment of an AI system’s performance. We covered this topic in one of our previous posts (How important is a golden dataset for llm pipeline evaluation).

There are several options to collect an evaluation dataset: use a publicly available dataset, manually collect data or use synthetic data. The challenge with public datasets is that they are not specific enough (and have been probably used to train your model), while tailored human-labeled datasets take a lot of time and effort to create. Synthetic data is a great alternative — combining speed with quality. It can sometimes even cover more granular and complex cases than humans can.

In practice, to allow for high quality synthetic data, at least a very small sample of human-labeled dataset is required — it can be used as the seed for the synthetic dataset generation pipeline to build upon and enable the volume and diversity we need for good evaluation (as we will see later).

How to use Synthetic Data for testing?

Synthetic Dataset is essentially replacing human annotation in the process of creating large-scale reference datasets. The data is used as ground-truth (or in other words as reference) to test an LLM application pipeline.

An analogy to understand how synthetic data is used for testing is the following: imagine that we are (synthetically) generating the exam rubric we want our AI to pass. The exam rubric would include questions, correct answers, and the correct intermediate steps to get to the answer (for example from which page of the book we can get the answer). We then give the same exam questions to our AI application, and see how well the generated answer matches up the expected answer. Using these questions, we evaluate results to measure the capability of the AI application and find wrong answers / steps to improve upon.

It’s evident that we have two steps ins this workflow, 1) generate the application-specific synthetic data, and 2) test the system capabilities. Let’s look at each step more in detail:

Step 1. Generate

The first step to generating high-fidelity, application-specific synthetic data is to define how the data should look like. This includes specifying three aspects:

Application Logic: the structure of the application pipeline (e.g. RAG, agent flow, tools, etc.) informs Generator to create synthetic inputs and outputs in the same format that can be used for a tailored evaluation.
Environment Data: real context data the LLM product uses and operates in, such as documents, websites, vector database etc.
Seed Example Data (optional but key to the quality of synthetic data): example application data (such as example questions, example answers, etc.) can guide the Generator to create more realistic synthetic inputs and outputs. These seed data can be from historical production data, or from samples labeled by the AI team or subject-matter experts.

Let’s walk through an example RAG application for Wikipedia articles. The application is chatbot that 1) takes natural language questions from users, 2) runs a retrieval step on the Wikipedia corpus to find relevant documents, and lastly 3) uses an LLM to generate an answer based on the retrieved context. In this case the inputs are:

Application Logic: a retriever to find relevant context (pages and paragraphs), LLM generator to output answers
Environment Data: all the wikipedia pages, or a subset you want to test with
Sample App Data: example questions you expect people to ask and example answers that you’d like the system to output

The Relari synthetic data generation pipeline then takes all this information and generate synthetic data samples such as the below one:

Synthetic Input: Question
Synthetic Intermediate Steps: Source URL, Source Context
Synthetic Output: Reference Answer

It is important to notice that synthetic generators (like the one we developed at Relari) should leverages a combination of deterministic, classic machine learning models, and LLMs to create high custom data with sufficient fidelity and diversity.

Step 2: Test

After the synthetic dataset is generated, we can run tests with it. The specific workflow is:

Feed the same synthetic inputs (questions in our example) to your AI application and log the intermediate steps and final outputs
Evaluate the intermediate steps against the synthetical intermediate steps (e.g. evaluating the retrieval output against the reference Source URL and Source Context )
Evaluate the final outputs against the synthetic outputs (e.g. evaluating final generated Answer by comparing to the Reference Answer)

The same way just a few questions would not be enough to assess the knowledge and capabilities of a student, we leverage a large number (hence large scale) of questions to assess if a system is able achieve the performance we expect. The collection of multiple metrics and the synthetic data enable developers to pinpoint specific shortcomings of the system.

Step 3: Continuous Improvement

A third step after the initial version of synthetic test data is generated is the continuous iteration step. Although an initial set of seed example data is already fed into the system, it may not be representative enough over time. For many AI applications, the way users interact with them shifts over time (think about your first 10 questions to ChatGPT vs. what you use it for now on a daily basis).

Therefore, it is important to monitor the production data to make sure the drift in real world application data distribution is reflected. We talked about how to leverage user feedback to improve evaluation metrics in this article (How to make the most out of LLM production data: Simulated User Feedback). We will talk more about how to use Production Monitoring to continuously improve your synthetic test data pipeline as well in another article.

Testing LLM Agents with Synthetic Data

Agent workflows are notoriously difficult to test and improve. Synthetic data can be leveraged to create granular functional tests and end-to-end tests to evaluate capabilities of agents.

Synthetic data for functional tests

For example, below is a sample of synthetic data generated to test a coding agent’s accuracy of make specific changes to a given repo given natural language instructions.

The inputs to the Generator includes:

Application Logic: An agent making specific code changes based on instructions
Environment Data: Python repositories
Seed Example Data: Sample code change instructions such as Add, Remove, Modify, etc.

The above synthetic data contains the following:

Synthetic Input: Instruction, Source Repo
Synthetic Output: Reference Diff

Agent’s synthetic data can help developers identify where performance is an issue, but also help ensure that performance does not degrade as the system continues to evolve.

Synthetic data for end-to-end tests

You can also generate synthetic data that tests LLM agent to complete an end-to-end task that is comprised of multiple sub-tasks.

For example, the following synthetic data is generated to test an SEC analyzer agent’s ability to fetch the right SEC documents, find the right data sources, make the correct calculations and out the the right qualitative analysis.

The inputs to the Generator includes:

Application Logic: Agent that 1) categorize the question 2) look up the right filings to answer the question, 3) find the best source of data, 4) call tool to make financial calculations, 5) output a well-formulated answer.
Environment Data: SEC Edgar Database
Seed Example Data: Sample numerical questions, desired analysis and answer format

The above synthetic data contains the following:

Synthetic Input: Question
Synthetic Intermediate Outputs: Question Type, SEC Filing URL(s), Source Data, Calculations
Synthetic Output: Reference Answer

With these granular data, you can now evaluate the agent’s ability to execute an end-to-end task with all the required intermediate steps.

Sim-to-Real Gap: Challenges with Synthetic Data

How do you ensure synthetic data is realistic, diverse, and accurate enough? If the quality of your synthetic data is not high enough, you are just optimizing for the wrong things.

In the beginning, it is always recommended to spot check the quality of synthetic data manually, prune out bad samples, and iteratively feed more good examples for the pipeline to learn to generate higher quality results.

There are also systematic ways to evaluate the quality of the synthetic data generated and measure the distance between the synthetic data and the real data in production. A good synthetic data pipeline should be able to use that information to continue to improve its quality.

At the end of the day, Synthetic Data is not a total replacement for human-labeled datasets, but rather a powerful complement. If you can label 100 examples manually, you can 10x that easily with synthetic data to cover more diverse test cases at higher volume.