Exploring Synthetic Data Use Cases

How DataFabrica Aims to Democratize Data Knowledge through Synthetic Data

Sadrach Pierre, Ph.D.
DataFabrica
7 min readDec 12, 2023

--

Image by Pixabay on Pexels

Synthetic data is information that is artificially generated using algorithms and domain expertise. Synthetic data simulates the characteristics and patterns of real events and observations without compromising private or personal information. With the right domain expertise, it is possible to accurately capture the distributions and patterns of real events.

Synthetic data is useful and often necessary in many contexts. For example, whenever data contains real personal information, such as medical history and financial information, it is protected by laws and regulations. For instance, credit card information is protected by the Fair Credit Reporting Act (FCRA). Patient medical records are protected under the Health Insurance Portability and Accountability act (HIPAA). While this is for good reason, since it ultimately protects the consumer, it unfortunately results in a gap, where companies, researchers, students and educators have limited access to these types of data. This means that the data, use cases and associated learnings that come from these data are inaccessible to a large population of people otherwise interested in the problems in these spaces. Synthetic data can be used to bridge this gap, allowing those interested the ability to learn from these types of data, without the worry of dealing with sensitive/protected data.

Synthetic data can come in a variety of forms including tabular data, image data, text data, video data and audio data. For example, many health technology companies use synthetic image data in place of sensitive patient records. This allows companies to proceed with developing tools, applications and products without compromising the privacy of patients. Several companies also use synthetic image data for a wide variety of use cases including autonomous vehicles, virtual reality, object detection algorithms and more. Synthetic audio data has been applied to several use cases including speech recognition, music generation, speech-to-text applications and more.

There are a wide variety of applications and use-cases for synthetic data. This includes product prototyping, data augmentation, research, and education. Consider a health technology start up that would like to develop a diagnostic device that is able to detect the rare disease, Niemann-Pick Disease.

Image Created by Author (Brain CT scans)

Niemann-Pick Disease is a rare condition where excess fats build up in organs within the body such as the brain, liver, lungs and bone marrow. Because diagnosing this disease requires sensitive patient records, for example an MRI or CT scan, real examples of patients with this disease is protected under HIPAA. This means that the company that wishes to develop the diagnostic tool needs to jump through the regulartory hoops and hurdles required to access the data which can take months to years. Alternatively, this company can purchase non-sensitive synthetic imaging data that realistically represents healthy patients and patients with Niemann-Pick Disease.

Now suppose you are a healthcare instituition with access to real patient data but the examples of patients with Niemann-Pick Disease are significantly fewer than healthy patients. If you are building a rare disease detection tool and very few examples of patients with Niemann-Pick Disease, the detection algorithm is sure to have poor performance. With that, data augmentation is another use case for synthetic data. Oftentimes, in cases where too few examples of an observation is available, companies or researchers will augment real data with synthetic data to increase the number of those observations in the data set.

In addition to product prototyping and data augmentation, synthetic data can also help research institutions test deep learning models for tasks such as object identification, speech recognition and more. Further, synthetic data can open the door to a wider variety of business use-cases for educators teaching machine learning and data science. It is unfortunate that the types of problems we can teach our students to solve is currently limited by whether the data is sensitive. Data oriented problems and the learnings from their solutions should be democratized as it will only help us as a society get to better solutions faster. Synthetic data does not just democratize data, it democratizes data knowledge by making it more easily accessible.

Product Development

Many companies have used synthetic data to develop products due to limited access to real data. For example, consider Alexa, a virtual assistant that is voice activated and can help Amazon member play music, place Amazon orders, read stories, tell jokes and more.

Amazon Echo Dot with Alexa

In the past, Amazon used synthetic data to bootstrap new language releases for Alexa. This is a great example of a company not letting their limited access to data thwart their ability to develop a new product or product feature.

Consider the health insurance company Anthem. Anthem partnered with Google to generate a synthetic data platform that can generate petabytes of synthetic medical patient data. This helped expedite the development process for fraud detection tools and personalized care.

Research & Development

Many research institutions have used synthetic data to bootstrap their AI development efforts as well. For example, researchers at MIT released CoSy, a configurable system that is used to generate data to train deep learning models used in self-driving cars.

Further, UC Davis researchers were awarded a $1.2 million grant to generate high quality healthcare data. The goal of this project being to develop machine learning models that can help researchers diagnose, predict and treat diseases.

Data Science & Machine Learning Education

Synthetic data also has enormous potential in the space of data science and machine learning education. Curriculums can be more specialized with synthetic data. For example, a curriculum on machine learning applications in healthcare can be powered purely by synthetic data. Students would be able to learn about many healthcare use cases that would otherwise require access to highly sensitive data. These include predicting patient readmission, assessing patient risk for insurance companies, predicting and diagnosing rare diseases, and much more. These problems and the knowledge that comes from their solutions can be accessible to a wider audience through synthetic data usage.

DataFabrica’s Mission

DataFabrica is an online data marketplace that aims to further democratize data and knowledge from data. DataFabrica provides affordable, ready-to-use, realistic synthetic data in retail, healthcare and finance industry verticals. At DataFabrica we design and curate our data sets to accurately represent real business scenarios. In retail this includes business use cases such as customer segmentation, customer churn analysis, product recommendation and more. Within the healthcare industry we provide synthetic patient payer data and diagnostic imaging data for rare disease detection.

Whether you are a founding a tech start up that requires sensitive data, are a researcher at an academic institution, or simply an educator of data science, at DataFabrica we believe anyone should be able to access the types of data they are interested in using.

Check out the following articles which explore some of the available synthetic datasets on DataFabrica.

Healthcare Analytics

Exploring Healthcare Patient Payer Data in Python

Image by Pixabay on Pexels

Unlike many other types of data, payer claim data is protected by the Health Insurance Portability and Accountability act (HIPAA). This makes it difficult for many innovative health tech start ups to develop and innovate in the healthcare space. Synthetic payer claims data is a good option for small players in the space as it can enable companies to build out proofs of concepts (PoCs) without the hassle of acquiring sensitive patient data.

This blog tutorial walks through how to perform exploratory data analysis on the Synthetic Healthcare Patient Payer Claims data available on DataFabrica. The free tier is free to download, modify, and share under the Apache 2.0 license.

Using Pareto Analysis to Analyze Patient Readmission

Image Created by Author (This plot is for illustrative purposes and does not reflect real data)

This blog walks through a tutorial on how to identify top causes of patient readmissions using Pareto Analysis. The Synthetic Healthcare Emergency Room Readmission data is available on DataFabrica. The free tier is free to download, modify, and share under the Apache 2.0 license.

Retail Analytics

Image by Andrea Piacquadio on Pexels

Customer Segmentation with Credit Card Transaction Data

Image Created by Author

This blog post walks through how to use recency, frequency and monetary (RFM) scores to generate customer segments using the Synthetic Credit Card Transaction data available is on DataFabrica. The data contains synthetic credit card transaction amounts, credit card information, transaction IDs and more. The free tier is free to download, modify, and share under the Apache 2.0 license.

Personalized Marketing with Customer Segmentation and Collaborative Filtering

This blog post discuss how to use customer segmentation and collaborative filtering to generate personalized product recommendations. It also utilizes the Synthetic Credit Card Transaction data.

Feel free to download and explore the free tier versions of the data!

Conclusions

While some types of data should surely be protected, the patterns and learnings from these data should be available to everyone. At DataFabrica we believe that anyone interested in learning about the problems and solutions involving sensitive data should be able to easily access that information. The more people who are aware of the problems that are involved with these types of data, the faster we as a society can come up with better solutions.

--

--

Sadrach Pierre, Ph.D.
DataFabrica

Writer for Built In & Towards Data Science. Cornell University Ph. D. in Chemical Physics.