We Need to Talk About Synthetic Data

ReD Associates
5 min readApr 13, 2022

--

Part III: Social Science in a Synthetic Data World

Photo credit: Maksim Tkachenko (iStock)

By Mikkel Krenchel and Maria Cury

In our previous post, we discussed the pitfalls of relying too heavily on synthetic data and all the ways this data revolution could go sideways. So how do we avoid the pitfalls of synthetic data and create the transparency and data literacy needed for all of us to make sense of this new world of data? Self-servingly perhaps, this is where we believe the social and human sciences ought to get involved. The input most crucial to making sure the synthetic data revolution does not simulate low-quality reflections of the world we live in (or worse, create worlds we didn’t intend) is small, not big, data. In a synthetic data world, the quality of the initial, small dataset from which the synthetic data is derived, is absolutely paramount. And so is a deeply contextualized understanding of that dataset itself — where it came from, what it can be used for, what it explains, and what it doesn’t. This is the kind of context that is difficult to obtain, make sense of, or relate to underlying structures and biases.

Anthropologists are trained in the collection of ‘thick data’ — or what Clifford Geertz referred to as “thick description” — the messy, raw, real world data (usually with innumerable confounding variables) that you can only collect by going out into the world, observing the larger cultural meaning of what’s going on, and paying close attention to social norms, culture, and context. They are trained to understand the limitations of data in informing our decision-making, and how, if mishandled or misused, data can exacerbate hidden biases or have other unintended consequences. It could be exactly the type of input and expertise needed to guide the next generation of synthetic data-driven AI.

For anyone who, like us, is interested in the social sciences and making sense of humanity, AI that can generate high-quality synthetic data ought to inspire amazement, even awe. Because looking at real data of what people do or experience, and then deriving a set of predictions (or theories) about what other people (perhaps imagined, perhaps more generalized) would do, is — in our view — exactly what the best social scientists do. C. Wright Mills claimed that the most critical skill in the social sciences was what he called the sociological imagination — or the ability to draw on historical, social, and psychological data to make sense of what we do, and extrapolate what we might do next, or what might be different under different circumstances. The best social scientists often rely on limited datasets to understand and imagine entire social worlds. It is thought provoking that computers are now starting to make the same imaginative leaps, even if there are of course still significant limitations and issues with this approach.

However, how should we think about the inferences about people that machine learning algorithms are increasingly able to make? It is tempting to use these advances in machine learning to pit computers against humans and see which is ‘better’, or declare ‘the end of theory’ all together as some have done. On one hand, we know very little about how computers actually come up with the patterns they do, because of the opaque nature of the underlying neural networks. On the other, we know equally little about how our own minds function. So how can we meaningfully compare them? Is comparison even useful? There’s a reasonable chance that even if the outcomes seem comparable (e.g. worlds as imagined by people vs. worlds as imagined by machines) the way we get there is fundamentally different and will systematically produce different results over time. We simply don’t know. For the time being, perhaps the better heuristic is to think of human imagination and intuition and its machine counterpart as two fundamentally distinct and complementary approaches. This view suggests that the future of the social and human sciences might involve human researchers in a form of dialogue or “AI Dance” with machines to collectively build better models and explanations of the world, drawn from both real world data and synthetic datasets.

Concretely, the human sciences might be helpful in three ways:

· First, by delivering the highest possible quality initial datasets with a deeply contextual understanding of what matters most about the data and why. If that dataset isn’t grounded in a rigorous understanding of the most recent and the underlying human phenomenon — such as the differences between what people say and do, or the unexpected influence of tangential variables in our lives in the actions we take — it risks simulating a social world that shortchanges reality in ways that harm both the company firm and everyday people.

· Second, by helping develop the right heuristics for when and how synthetic data could and should be used. This includes providing context and input on the ethical tradeoffs between gains in privacy and valuable new applications of AI on the one hand and, on the other, the risks associated with “unknown unknowns” in the data that can perpetuate biases.

· Finally, by helping data companies and their engineers define the right benchmarks and success criteria to validate the models against.

In the future, synthetic data will be a much bigger part of our daily lives. It has the potential to restructure everything from the algorithms that shape our experience of the world, to our understanding of data and reality, to the role of the social sciences in society. The stakes are too high to leave these important decisions to data scientists alone — social scientists (as well as policymakers) have a role to play. Otherwise, the effects of this data revolution could be disastrous. Not because synthetic data would be unhelpful or worse than some datasets we have today, but rather because we fear it will likely be too helpful.

Mikkel Krenchel and Maria Cury are partners at ReD Associates, a social science-based consulting firm.

--

--

ReD Associates

ReD is a strategy and innovation consulting firm based in the human sciences