Synthetic Data: Take 2

Shobhankita Reddy
Speciale Invest
Published in
8 min readSep 19, 2023
(Source)

For the last six months, Generative AI and its potential has had all of us in a frenzy.

What does this new reality and technological capability, which will only get better and better over time, mean for industries, we asked ourselves. What is the current state of this technology, which applications is it today best suited for, and where might its shortcomings in privacy, security, and accuracy show up? How will this paradigm shift affect not just businesses but real people and their jobs, and what they show up every day to do?

We also asked ourselves if this might be a fad and if we should tread with caution. I argued that AI has been through several iterative evolutions and “hype cycles” — a term that many in the GenAI community have since informed me characterizes an AI non-believer (which I assure you I am not).

Nevertheless, we at Speciale Invest were interested in second-order consequences of Generative AI. One of the most important being around data.

In 2016–2017, Speciale Invest was founded with a thesis on deep science and tech, and a keen view on fundamental technological innovations. This coincided in timelines with the last AI wave — specifically in conversational AI and natural language processing — and we were lucky to partner with tenacious founders building in these spaces with strong technology moats (looking at you Wingman by Clari, Truelark and Looppanel).

These moats were largely based on proprietary machine learning models. Today, Generative AI’s foundational models have largely commoditized model-building. It is very difficult today to patent machine learning algorithms too, many of them now being open-sourced.

Today, moats and differentiation in being able to use foundational models arises from Data. A lot of data is required for these models, but it also needs to be structured, labeled and annotated data. Proprietary data, specific to industries and internal to enterprises, is extremely useful for very pointed use cases and workflows relevant to these businesses.

In exploring data as a fuel, we chanced upon synthetic data.

Synthetic data is a type of information that is artificially created or generated by computer programs or algorithms, instead of being collected from real-life sources. It is designed to mimic or resemble real data, but it is not a collection of actual observations or measurements.

Around two months back, we had published our take on Synthetic Data in a blog post here. We deep-dived into what synthetic data is, the problems it can solve, the technology that powers it, and our view on this market.

Post publishing the blog post, we were lucky to speak to many data infrastructure folks — ML engineers, data scientists, as well as founders building, experimenting, or working with synthetic data. We are grateful for the learnings and experiences shared with us and in this blogpost, I want to summarize some of these conversations as a follow-up.

#1 Synthetic Data Generation, especially artificial generation of already existing real-world data, is a fundamentally hard problem to solve.

Much of synthetic data generation today is powered by Generative AI, more specifically Generative Adversarial Networks, as opposed to Discriminative AI.

Discriminative AI is algorithms designed to distinguish between categories of data, classifying them or making predictions off them. Generative AI, however, can “generate” seemingly novel information through extensive training. Generative AI is great for use cases in creative endeavors — think advertising, sales or marketing functions, or any scenario where artificially generated content can directionally help people channel their creativity and better brainstorm with a computer system. These happen to be contexts where synthetic data does not necessarily need to replicate any already existing real-world data in its structural characteristics.

Where synthetic data generation gets challenging is in artificially generating an accurate replica of real-world data. Indulge me in a thought experiment please.

  • Remember that data in rows and columns can be modeled as a distribution represented as a mathematical function.
  • Think of a dataset that consists of two columns and multiple rows of information. Typically, one of these two columns depends on the other. And what can be mapped out, let’s say, is a simple function in two variables x and y.
  • Let that be represented as a normal curve/ Gaussian distribution shown as below.

What this means is that there are multiple (x, y) pairs in the dataset which when mapped into a 2-D space can be defined more or less (as best fit) by the curve shown above in blue.

  • Generating synthetic data of the above dataset means artificially generating a dataset that satisfies the following two conditions -

#1 No (x, y) pair of the artificially generated dataset should be the same as the original dataset

#2 The overall best-fit curve of the artificially generated dataset should be the same (or as close to the same as possible) as the original dataset’s curve.

  • This means that synthetic data generation is an endeavor at multiple iterations of these artificially generated curves — can be visualized, crudely might I add — as below.
Generating multiple synthetic datasets (represented as distributions) of the original dataset
  • Now remember that this is Generative AI, and not discriminate AI, and that it is anybody’s guess how these multiple generated curves turn out to be, what their area of intersection with the original dataset’s curve might be, where the artificially generated curve and the original curve might go beyond an acceptable threshold important to the business problem at hand, and so on.
  • To summarize, the probability of the generated curve being similar but not the same as the original curve is abysmally low. And this was only an example with two columns and a function in two variables. Imagine the complexity with a dataset of hundreds of columns where multi-variate dependencies exist, and most or all of them need to be captured in the synthetic data. Think also of other kinds of data such as videos and images, the mathematical representations of which need to align with those of their synthetic peers.
  • Put another way, generating a synthetic dataset off of a real-world dataset is hard.

Another important thing to consider is the second or third-order problems with synthetic data -

  • For tabular data, there are aggregate first-order metrics like mean, median and mode that characterize the data. Remember that mean is the average of a set of numbers, mode the most occurring number in the dataset, and median the middle number when you arrange the dataset from the smallest to the largest.
  • A combination of some of these first-order metrics might make second-order aggregate metrics like variance, standard deviation, and correlation. Variance is a quantitative measure of the spread of data values around the mean, a first-order metric, for example. Standard deviation too depends on the mean.
  • A combination of some of these second-order aggregate metrics makes third-order aggregate metrics and so on until the nth-order aggregate metrics.
  • Different kinds of applications need different orders of aggregate metrics between the original dataset and the synthetic dataset to match and that too, within a margin of error. This is hard to guarantee, especially given that Generative AI brings forth no explainability or control, making synthetically generated data that much harder to use.

Further, think about using this synthetic data, with no explainability, guarantee, or control, to train yet another black box — your machine learning models — and what comes out is chaos.

This brings me to the other problem with synthetic data today.

#Businesses don’t fully trust synthetic data yet

At the risk of generalization, a common thread that came out from my conversations with ML engineers was that synthetic data generation was the last for them to go to better their machine learning models — only after trying everything from supervised learning, unsupervised learning, more finetuning, better input data quality in terms of labeling and annotating, more input data volume, etc.

And when they did experiment with synthetic data, their machine learning models did not suddenly bump up in accuracy in a meaningful way and in fact, it became harder to pinpoint false positives/ negatives in their models.

This could perhaps be attributed to the fact that most research in this space is still very nascent, and has a long way to go. But nevertheless, what it means today is that while the problem that synthetic data can solve is real, businesses need to be convinced to experiment with it, let alone use it in production.

Founders building in this space are aware of this sentiment too, and are working around this with very many different ingenious ways — with low-code/ no-code synthetic data generation platforms, using statistical rule-based engines (that are more explainable as opposed to Generative AI) to generate synthetic data, targeting use cases in data sharing as opposed to machine learning model training, etc.

Of course, the nuance here, like many things in life, is that “it depends” — on the industry, the specific problem statement being addressed and its criticality, and the nature of data being synthesized (text, image, video, tabular, time-series, 3-dimensional, and so on). Some very large synthetic data generation companies do exist catering to very specific industries, problem statements, and types of data. If you have any insights to share here, please do reach out. We would be grateful for your learnings.

_________________________________________________________________

The market for synthetic data is exploding. This is a real, pressing, and massive problem for many industries and will only grow. Most companies in this space are still fairly early, indicating a potential opportunity in this space in the coming years.

Coupled with all the challenges that businesses face in this space today, we believe that this is a great greenfield for technological innovations to break in and disrupt the market.

At Speciale Invest, we believe in supporting engineering-first innovations that have the potential to solve global, pressing problems.

We do one thing — seed-stage investing. We enjoy and thrive in the risk that comes with backing deep science and tech-focused startups right from the earliest stages. We like to get our hands dirty through the founders’ zero-to-one journey, and help with team hiring, achieving product-market fit, initial customers, and scale-up.

If you are working in the data infrastructure space, please feel free to reach out. We would love to hear from you, learn from your experiences as to what is working in the market, and help you in any way we can. Please write to us at shobhankita.reddy@specialeinvest.com or dhanush.ram@specialeinvest.com.

--

--