An open-source dataset to test LLM private knowledge understanding

Luca Canale
Sarus Blog
Published in
3 min readJul 5, 2024

Sarus builds privacy-preserving AI solutions. Many of our AI use cases involve:

  • Using Retrieval-Augmented Generation (RAG) with private documents that contain relatively deep new knowledge.
  • Fine-tuning large language models (LLMs) on a corpus of private documents that also contain deep new knowledge.

In these setups, testing the capabilities of LLMs is tricky. With other models, one would typically start by benchmarking on a public dataset and evaluating its performance. However, with pre-trained LLMs, there is always a good chance that such a dataset was already seen during pre-training, which can lead to overestimated performance and capacities. If one is lucky enough, it may have some private data to test against but in some domains like healthcare private data is sensitive and cannot easily be accessed. Additionally, private data cannot be used as a public benchmark.

To allow testing and evaluating LLM’s performances in such contexts, we have created a novel synthetic dataset featuring fictional diseases, symptoms, and medications. Symptoms contain both plausible ones like headaches or nausea but also fake ones like ‘wing growth’ or ‘fire resistance’. Diseases and drugs are also invented but their names are reminiscent of existing ones and correlated (eg: “lithodermia” is a disease and “stonezine” its cure).

This dataset is designed to ensure that the model has no prior information about these fictitious conditions: it is therefore a useful tool to test LLM’s capabilities both for retrieval-augmented generation or for fine-tuning. For the former, we can test that the additional information is found correctly, for the latter, the model has to learn about diseases during training and also minimize its biases from pre-training data (that would lead for example to discard fake symptoms).

The dataset is made of 10k examples where 100 triplets (disease, drug, symptoms) have been used to generate a question and its answer. Each triplet appears between 10 and 200 times, following a Poisson distribution. Each example consists in a patient asking a doctor for some advice, sharing his symptoms and some private information that could be linked to its disease and the doctor response:

The dataset is free to use on Hugging Face. Moreover, such dataset can be a great tool to also evaluate privacy risks: how much private information is the LLM memorizing while training, can it be leaked? Is it possible to learn about diseases while preventing from it? More to come in the next post on these topics!

New versions of the dataset could be published regularly to address the issue that any public dataset will eventually be used to train foundation models.

This post is one in a series of posts on AI and privacy. How to use AI and in particular commercial LLMs (for in-context learning, RAG or fine-tuning) with some privacy guarantees but also how AI and LLMs can help us solve privacy challenges. If you are interested in knowing more about existing AI with privacy solutions contact us and try our open-source framework: Arena (WIP).

See also:

--

--