What is synthetic data?
In 5 minutes
Let’s start with a little game.
Can you guess which of these people actually exist in real life?
Do you have your answer?
Why synthetic data?
Nowadays, we can find examples of AI systems everywhere: when we search for something on Google, when we receive recommendations on what to watch on Netflix or YouTube, when our word processor suggests new ways to express something or when we search for the shortest route on Google Maps.
In order for a computer to learn to perform such difficult tasks well, it requires a vast amount of information that is not always available. High-quality data is difficult to obtain, either because it is very expensive to acquire (e.g. medical images), there are no easily accessible tools to gather this data (e.g. real-world scenarios) and because of privacy concerns. Current privacy regulations are more restrictive, and while they are necessary to preserve the privacy of individuals, it is more difficult for researchers to have access to the data and therefore obtain valuable research findings.
Can we share data safely?
Privacy-enhancing technologies are a promising solution to facilitate data sharing while fully preserving the privacy of individuals. In fact, Gartner identified Privacy-Enhancing Computation and Generative AI modelling as two of the 12 top strategic technology trends for 2022 and predicted that, by 2024, 60% of the data used for the development of AI and analytics projects would be synthetically generated.
What is synthetic data?
Synthetic data generation is a modelling technique that allows us to generate synthetic but really realistic data.
Realistic means that it retains the same statistical properties as the original dataset, so we should reach the same conclusions as with the real version
Synthetic means that this dataset is no longer the original one and individual subjects or entities should not be identifiable
How can we generate realistic synthetic data?
Generative Adversarial Networks (GANs) have gained a lot of traction in the synthetic data field after showing promising results.
GANs are a type of Deep Learning model proposed by Ian J. Goodfellow and colleagues in 2014 and have evolved into many different architectures since.
A GAN is composed of two neural networks that are trained simultaneously: a generator, which is able to generate new samples, and a discriminator, which tries to detect whether each sample is real or fake.
GANs allow us to generate very realistic synthetic data
A common analogy which I love is the one of the art forger and the art inspector.
In this example, we have an art forger -the generator- who tries to forge paintings, and an art inspector -the discriminator- who tries to detect imitations of paintings. The art inspector and the art forger are constantly trying to outsmart each other, because, the better the art forger (or the generator) is at creating imitations of paintings, the better the art inspector (or the discriminator) needs to be at distinguishing real paintings from imitations.
The art forger (the generator) tries to forge paintings and the art inspector (the discriminator) tries to detect imitations of paintings
There are many alternatives to GANs for generating synthetic data, such as autoregressive models and variational autoencoders. But let’s leave it for another time ;)
Now it’s time for answers
Do you remember your answer to the game?
-Both images are fake!!
Both images are generated with a GAN, none of them exists in the real world. Realistic, right?
Did I really fool you? Let me know in the comments section below ;)!
👏 Hi reader. Thank you for giving me your feedback. This is my first post, so I would appreciate it if you clapped and followed me if you liked it. Follow me if you want to continue reading about Synthetic Data, data ethics and privacy.