Synthetic Data

Surya_Nuchu
Apr 26 · 3 min read

The world is the biggest data problem — Andrew McAfee

Introduction

Every year the world generates more data than the previous year. According to International Data Corporation, in 2020, an estimated 59 zettabytes of data will be “created, captured, copied, and consumed”.

Although the data is burgeoning, it doesn’t mean everyone can access it. Companies and organizations are concerned about their user privacy. And now the Covid-19 impact had lead to the shut down of research labs, organizations etc. Without access to the observed data, it is tough to train the machine learning models or other industry needs. Enter Synthetic Data: “any production data applicable to a given situation that is not obtained by direct measurement” — McGraw-Hill.

Synthetic Data and its real-time use cases

What is Synthetic Data?

As the name suggests, synthetic data is artificially created rather than being generated by actual events. It is often made with the help of algorithms and is used for a wide range of activities, including test data for new products and tools, model validation, and AI model training.

Synthetic information is affordable to supply and might support AI / deep learning model development, software package testing. Data privacy (i.e. information privacy enabled by synthetic data) is one of the foremost vital advantages. User information ofttimes includes recognizable in-person data (PII) and Personal Health Data (PHI) and permits corporations to create software without exposing user information to developers or software package tools.

Real-time use cases

  • Amazon using synthetic data to train Alexa’s language system

Generating Synthetic Data in R

The synthpop package is an add-on package to the statistical software R. It is freely available from the Comprehensive R Archive Network (CRAN). It can be downloaded and installed, for example, from inside an R session via

install.packages("synthpop")

Once the synthpop package is installed, it needs to be attached to the current R session by the command

library(synthpop)

To generate and test the efficiency of synthetic data, a real-time data set is used[1].

Load the data into R space

df_observed <- read.csv(file = "/Users/reputation/HeartRate.csv")

Generate the Synthetic data using syn(), where m specifies the number of synthetic data sets. The observed dataset contains body temperature, sex, heart rate as labels.

df_synthetic <- syn(df_original, m = 10, method= "cart", cart.minbucket = 10)

compare() can be used to compare df_observed and df_synthetic. This clearly shows the difference between observed data and synthetic data.

compare(df_synthetic, df_observed, vars = "HeartRate")
Figure(a) — Comparing observed data and synthetic data for HeartRate in cart mode
compare(df_synthetic, df_observed, vars = "BodyTemperature")
Figure(b) — Comparing observed data and synthetic data for Body Temperature in cart mode

By varying the mode we can generate multiple patterns of synthetic data.

Figure(c) — Z-value comparison between observed and synthetic

Conclusion

In this article, I presented the fundamental importance of Synthetic Data and the functionality of the R package named “synthpop” for generating synthetic versions of microdata containing confidential information.

References

  1. https://tuvalabs.com/datasets/body_temperature_sex__heart_rate/activities

Thanks for reading and good luck — Surya Nuchu

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Surya_Nuchu

Written by

Software Development Engineer

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com