We Need to Talk About Synthetic Data

ReD Associates
5 min readApr 12, 2022

--

Part II: The Reality Gap

Photo credit: Hendra Su (iStock)

By Mikkel Krenchel and Maria Cury

(This post is part of a three-part series on the synthetic data revolution. In our previous segment, we outlined what synthetic data is, why it matters and some of the potential advantages this kind of data offers — from reducing bias to reducing inefficiency and costs. If you haven’t read it yet, we suggest you start there. In this segment, we dig a little deeper, and begin to examine some of synthetic data’s pitfalls and potential consequences.)

Even with its capacity to minimize known historical biases, it would be a mistake to think that synthetic data is bias-free. Data without bias is generally an illusion — people make decisions about what data to include, exclude, and how to analyze it, and those choices are based on what’s deemed important or relevant, which is usually biased. This continues to be the case when it comes to making decisions around synthetic datasets. Engineers generate synthetic data based on a smaller sample of ‘real data’ that is labeled with all the aspects deemed relevant for the AI to train on, and a set of rules that seek to counteract any obvious, known biases in the original dataset. But the whole point of bias is that we all suffer from it and often can’t see it ourselves. And there is more complexity and nuance in reality than we’ll ever be able to systematically reflect and account for in synthetic datasets. So long as humans are the ones making decisions on which of these datasets should be built, which problems they should solve, and what real-world data should be their basis, we will never be able to fully remove bias. And as such, synthetic data can reproduce patterns and biases from the data it is drawn from and even amplify them.

Similarly, the world keeps changing and any data sample that forms the basis of a larger dataset will invariably be a portrait-in-time. Even the best synthetic data may quickly grow obsolete if the real world evolves in a different direction from what the algorithms expect, based on factors that the humans who designed the algorithms couldn’t account for or anticipate. In other words, synthetic data may help us represent — or amplify — what we already know and can foresee. But if that is all we rely on, we may miss the opportunity to discover something new about our constantly changing world.

In a worst case scenario, we get an echo chamber effect, whereby AI feeds the AI and the models that develop and control key aspects of our world — the information we consume, the digital worlds we frequent, the medical advice and products we receive, or the price we pay for insurance and many other products — increasingly respond to an internal logic divorced from the reality we inhabit.

Within the data science and ethical AI community, there are people working hard to address the potential reality gap, on the technical front by coming up with new models and methods with the power to reduce bias and increase accuracy. But this is not just a matter of better tools and it shouldn’t be a matter left only to data science. As a society, we need clear ethical guardrails for when, how, and how much synthetic data can and should be used. We need to ensure synthetic data stems from the best real world data we can find. And we need clear and transparent practices for how to validate and benchmark synthetic data models.

A dangerous default

Used responsibly and carefully, it is likely that engineers can minimize the reality gap, and avoid many of the direct pitfalls associated with synthetic data. But we shouldn’t just be concerned with how synthetic should be used — we should be concerned with how it might be misused. What happens when engineers, scientists, and business leaders the world over can either turn to readily available and cheap synthetic data, or do the arduous work of collecting new, original real-world data? In particular, what happens if and when synthetic data builds a ‘reputation’ as a better alternative to real data? It does not take much to imagine that even the best intentioned engineers, scientists, and business leaders might start defaulting towards using synthetic data even in situations when they really shouldn’t.

Already today, we see many companies making decisions based on whatever available dataset they can find and calling it a ‘data-driven decision,’ even when the datasets are clearly biased, incomplete, or obsolete. It’s better than nothing, goes the thinking, particularly in scenarios where collecting new raw data is prohibitively difficult or expensive. In this way, the growing availability of synthetic data might make firms or organizations disinclined to do original research and data collection. And that’s dangerous because even the best synthetic dataset will never be a representation of our constantly changing reality that can answer all questions and inform all decision-making. If that dataset isn’t grounded in (or perhaps made from) a rigorous understanding of the most recent underlying human phenomenon — such as the differences between what people say and do, or the unexpected influence of tangential variables in our lives in the actions we take — it risks simulating a social world that shortchanges reality in ways that could cause real harm to everyday people. And this is before we even begin to contemplate more nefarious uses of synthetic data, such as deep-fakes or misinformation at massive scale.

A new definition of ‘data’ and ‘truth’

Relatedly, what happens to our understanding of what ‘data’ is when synthetic data is everywhere? We already live in an age of misinformation, where understanding the origin and bias of any data we look at is increasingly difficult. Much of the ‘real’ quantitative data we rely on to make sense of the world today, — statistics about what people do, or survey answers about what they said — is already heavily processed, and decontextualized by the time anyone reads it. The coming avalanche of synthetic data not only blurs the lines further between ‘real’ and ‘artificial,’ it also promises to make it infinitely more difficult for the average data consumer to critically evaluate where the original data came from, how it was collected, and manipulated, and consequently to what extent we should trust it. How good was the model that built this synthetic data set? What can or can’t this data be meaningfully used for? Going forward, these could be questions that critical consumers of data (and of AI-powered services trained on data), will need to ask, and to which providers of synthetic data need to find intuitive, meaningful ways to answer. In other words, as a society, we are already struggling with data literacy and transparency, and with the growth of synthetic data it might be about to get a whole lot worse.

In the next and final installment of this series, we will look at what synthetic data might mean for the social sciences, and what the business, data science community, and research communities, can do to ensure synthetic data is used responsibly.

Mikkel Krenchel and Maria Cury are partners at ReD Associates, a social science-based consulting firm.

This is the second installment in a three-part series of posts on synthetic data. Please read the first part about the synthetic data revolution underway and the third part about social science in a synthetic data world.

--

--

ReD Associates

ReD is a strategy and innovation consulting firm based in the human sciences