Enhancing LLMs with User-Driven Synthetic Data

3 min readAug 15, 2024

Enhancing LLMs with User-Driven Synthetic Data

Garbage in, garbage out.

The quality of your data makes or breaks your model, impacting the model’s accuracy and performance in the real world. High-quality, diverse, and domain-specific data are key to building reliable models. But let’s be real — real-world data is rarely perfect. That’s where synthetic data comes in!

Synthetic data can fill the gaps when data is scarce, enhance privacy, and help reduce bias in Ai. It’s a cost-effective way to generate data and create a more accurate testing environment. Plus, it can boost your real datasets, making your models even more robust and reliable.

As we’ve seen, traditional ML methods and synthetic data creation techniques don’t work well for LLMs. Creating synthetic data for LLMs is tough — it’s hard to maintain realism, capture context, balance bias, scale data, ensure consistency, and validate quality. Plus, there are legal, ethical, and domain-specific challenges to consider.

Some companies, like RAGAS and Patronus AI, offer tools to help, but many enterprises still create hundreds of Question-Context-Answer pairs manually.

The practice shows that the available solutions are time-consuming and introduce bias. Evaluators simply do not have enough context and typically lack domain knowledge.

Fi places the end-user at the center of the development process as both judge and evaluator. With Fi’s dataset module, users can quickly create datasets for testing or fine-tuning LLMs. They can select individual chats, or make bulk selections based on shared insights or topics. Once selected, Fi’s engine then generates ground truth responses with the goal of aligning responses closer to the user’s original intent. This generation consists of user signals and intent based on implicit and explicit feedback. The algorithm treats users as the main evaluators and judges as they typically have domain expertise and appropriate context for domain-specific products.

This approach is crucial for enterprises evaluating RAG, fine-tuned, or prompt-engineered performance after deployment. Manually created evaluation sets can become outdated and unrepresentative of user intentions. Fi ensures that evaluation data stays relevant and up-to-date by continuously incorporating user feedback.

We performed a benchmark experiment on a synthetic dataset our team created. The idea was to perform ground truth generation and observe the quality of the suggested response instead of the original one. Here’s an example of a good ground truth generation:

Out of the 36 evaluated samples, 33 ground truth responses were of high quality as the one displayed above. We observed problems in 3 samples where our engine failed to generate a good ground truth response to the posed user query. By investigating the details of those samples we found out that the reason behind this was the existence of knowledge holes in the provided contexts. It is very natural as no LLM can fully answer a user query if the required information is missing in the context partially or fully. To combat this we apply our knowledge hole detection mechanism to filter out and warn users that those samples cannot be added to the dataset until the knowledge hole problem is resolved and generate high-quality ground truth responses for the rest.

If you’re looking to improve your LLM’s performance (RAG, fine-tuned, or prompt engineered) and ensure your evaluation data stays relevant, explore Fi’s Dataset module. Drop us a line here.

co-authors

Erik Harutyunyan

Feedback Intelligence

Enhancing LLMs with User-Driven Synthetic Data

Published in Feedback Intelligence

Written by movchinar

No responses yet