Synthetic Data

Mubarak Bajwa
5 min readJan 14, 2019

--

Synthetic data?! Sounds made up, what is that? To put it simply, it’s made up data. Well, it actually is much more than that and its looking like it might be more than just something you as a data scientist might come across a whole lot more than you think.

Synthetic data is information that’s artificially manufactured rather than generated by real-world events(i.e. made by a computer). It’s data that is created algorithmically, and it is used to test datasets, to validate mathematical models and, to train machine learning models.

So why the need for it. Data as a lot of you know is hard to gather, whether you get it from scraping sites to finding readily available datasets, when it comes to machine learning the more you have the better your machine learning model will become. If you’re Google, Apple, Facebook, or any of the other big tech companies, you can gather data more efficiently and at a larger scale than anyone else, simply due to their abundant resources and powerful infrastructure. The big guys use the data collected from you and everyone around you using their services to train their machine learning models. Unless you are one of the bigs, where do you get a mass amount of data that can accurately test and train your models. This is where synthetic data comes in handy.

Its inexpensive, easy to generate and, can be tailored specifically to your model. It helps the little guys especially startups to accrue a mass amount of data right from the get go. Healthcare and financial services are two industries that benefit from this. The techniques can be used to manufacture data with similar attributes to actual sensitive or regulated data (think GDPR). This enables data professionals to use and share data more freely. Researchers doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing while maintaining patient confidentiality. Another example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion.

A paper presented by Kalyan Veeramachaneni and co-authors Neha Patki and Roy Wedge at the International Conference on Data Science and Advanced Analytics describes a machine learning system that automatically creates synthetic data — with the goal of enabling data science efforts that, due to a lack of access to real data, may have otherwise not left the ground. “Once we model an entire database, we can sample and recreate a synthetic version of the data that very much looks like the original database, statistically speaking,” says Veeramachaneni. “If the original database has some missing values and some noise in it, we also embed that noise in the synthetic version… In a way, we are using machine learning to enable machine learning.”

They created a model and used it to generate synthetic data for five different publicly available datasets. They then hired 39 freelance data scientists, to answer this question, “Is there any difference between the work of data scientists given synthesized data, and those with access to real data?” They divided them into 4 groups and gave one of the groups the real data while the others three got the synthetic data. Each group used their data to solve a predictive modeling problem, eventually conducting 15 tests across 5 datasets. In the end, when their solutions were compared, those generated by the group using real data and those generated by the groups using synthetic data displayed no significant performance difference. “Companies can now take their data warehouses or databases and create synthetic versions of them,” says Veeramachaneni. “So they can circumvent the problems currently faced by companies like Uber, and enable their data scientists to continue to design and test approaches without breaching the privacy of the real people — including their friends and family — who are using their services.”

What does this type of data look like,

Code for creating a linear regression with noise

It can be as simple as this making linear regressions that can help algorithms to better identify them.

Or it can look like this,

Used a library called Faker to help generate a random list of names and also to generate a random sentence with key words.

Or like this,

Using produced images to train ai to keep track of inventory.

So when your working on your next machine learning model and you realize you need a ton of data and don’t have the resources or time to gather it remember using synthetic data can help you get that massive dataset you need to train your model.

--

--