[Just released] AI-based synthetic data generator!

A new Deep-Learning based synthetic data generator for an even smoother analysis experience when you can’t see the data!

Élodie Zanella
Sarus Blog
2 min readApr 20, 2023

--

We are thrilled to introduce the latest version of our synthetic data generation model. This new model now preserves the multivariate distributions between all columns of a table, in addition to the univariate distribution of each column. This makes synthetic data an even more useful tool for analysts and data scientists to gain insight into data they cannot directly access.

It is extremely efficient to prepare analyses, design machine learning pipelines, debug or test code. It is the natural first step before carrying out the analyses on the source data, which remains fully protected all along:

Usual preliminary exploration allowed by Sarus synthetic data preserving source data univariate and multivariate distributions
Comparison of real vs. synthetic data generated with the Sarus new generative model on different datasets & variables

This new deep-learning model was designed by the Sarus research team, based on Transformers and implemented in JAX, a state-of-the-art and powerful Python library that allows for high performance. If you want to learn more, we published a research paper on the topic.

Of course, this model integrates Differential Privacy to ensure that the generated synthetic data protects all personal information stored in the source data (more info on how to train a model in JAX with differential privacy).

This new model certainly helps analysts and data scientists work with sensitive data that they cannot directly access, opening up many opportunities for privacy-safe analysis use cases in healthcare, finance, energy, HR, and more. It’s useful everywhere companies or public authorities want to leverage data to innovate, but the data must be protected for security, compliance, and ethics!

Want to see what the high fidelity synthetic data looks like? Reach out!

--

--