Synthetic Data Generation for Data Engineering

Rule-Based Algorithms for Synthetic Data Generation

Benedict Soh
d*classified
8 min readJul 19, 2023

--

Benedict Soh (Engineer, Enterprise Digital Services — Data Platforms), discusses the use of synthetic data for data engineering, as part of his work developing a data analytics toolkit: Tool-Ally. He would be exploring different approaches to synthetic data generation in a series of articles. This first article dives into rule-based methods.

This was developed as part of Tool-Ally (a Data Analytics Toolkit). Tool-Ally offers a set of customizable and reusable components to streamline and automate various aspects of the data analytics lifecycle, thereby supporting data scientists in performing common data science tasks more efficiently and overcoming challenges in the data science lifecycle. More details about Tool-Ally can be found here.

Source: Quotefancy

Introduction

Data is a double-edged sword. While it holds tremendous potential in driving data-enabled decisions, it can also be abused if an unauthorised person gains access to it.

Synthetic data tackles this issue by enriching our data while limiting risks in the event of data leakage.

Synthetic data is information that’s artificially manufactured rather than generated by real-world events.

By generating synthetic data, you gain access to a larger pool of data in a shorter period of time at a much lower cost. You also have the flexibility of generating data to meet the specific needs of an application.

Purpose of Synthetic Data

  1. Enhance machine learning training and testing
    - Generate synthetic data when there is a lack of real-world data available with the aim of improving the performance of machine learning models
    - In cases such as anomaly detection, synthetic data can be used to simulate various scenarios and test the robustness of the model in various scenarios
  2. Data Privacy & Masking for Data Sharing and Preview
    - Protect sensitive data by masking Personal Identifiable Information (PII), while preserving the statistical characteristics of the original data
    - Allow consumers to preview data before requesting it in data-sharing portals
  3. Software Product Testing and Development
    - To generate realistic test data for testing models, algorithms, pipelines, and systems before deployment

Types of Synthetic Data and How to Generate Them

  • Tabular Data: By identifying the main data types you have in your data, devise a method to generate synthetic data for each data type. This will be further explained in the next section.
Source: Statice
  • Text Data: NLP models such as BERT, XLNet, and GPT-2 can generate speech using a seed phrase. An example is the Obama RNN, where synthetic political speeches are generated by the RNN model fine-tuned on his past speeches.
Source: Github-samim23
  • Video/Audio/Image: Using GANs or autoencoders, synthetic media can be generated for marketing, communication or education at a low cost. In the image below, none of the persons exists. They were all generated using StyleGAN.
Source: tasq

Scope

This article focuses on the creation of a rule-based model for synthetic data generation for data engineering.

Synthetic data is generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real world.

In data engineering, we are particularly interested in data privacy and data pipeline development and testing.

Source: altexsoft

Objectives

The data generated should:

  1. Contain a subset of values of the original data,
  2. Be in the same data format as the original dataset,
  3. Not have any identifiable data which can potentially be traced back to an individual,
  4. Not be able to be reverse-engineered such that the underlying algorithm is uncovered,
  5. Not contain unseen data unless required by specific business requirements.

Methodology

Method 1: Heuristic

By grouping your data into three data types, synthetic tabular data can be generated in the following ways:
- Date: Retrieve the earliest and latest date. Generate a random date within this range.
- Number: Retrieve the minimum and maximum number to create a range. Generate a random number within this range.
- String: For fields with a fixed set of values (e.g. Months), generate a random string within this set. For fields without a fixed set of values, define a regex pattern to generate the string. Use NLP models such as BERT/GPT-2 to generate free text for better representation. Non-strings such as boolean and numbers can be treated as strings in certain conditions.

Source: Edureka

A caveat with this method is that we lose the business logic behind a dataset since data was generated from scratch while treating each column as independent.

Method 2: Sampling

Compared to the first method, the second method uses a top-down approach where you start with a full dataset that will be filtered down to a synthetic dataset:

  1. Sample a subset of data from the original dataset (row-wise and column-wise)
  2. Remove sensitive information such as clear full names, IDs, handphone numbers, and addresses that do not provide any analytical value
Source: spss-tutorials

This method ensures that business logic stays in place, and different datasets can be joined for analysis. However, the data generated is more sensitive. If you are looking to generate data for app testing or pipeline development where there is no analytics involved, the first method is preferred.

Benefits of an In-House Algorithm

  1. You can customise your algorithm to fit the use cases provided by your users
  2. It is difficult to reverse-engineer the data to reveal the original dataset, especially if you are generating from scratch, or if you inject randomness into the algorithm
  3. The program is scalable and reproducible. Set a seed to reproduce your results
  4. You control the amount of data you are generating according to your needs (e.g. storage space constraints, stress testing)

Considerations

Data Privacy

Following the principle of least privilege, subset the population required and columns required. Ensure that minimal data that is barely sufficient for analysis/pipeline development/model development is generated. This is to minimise damage in the event of data leakage.

Handle your personal information carefully:

  1. Mask PII. Ensure that there is no clear identifiable personal information. IDs typically have a consistent format. Hence, regex does well in identifying unmasked information.
  2. Detect hidden sensitive data. Especially in free text fields, personal information can be hidden inside such columns. For instance, in feedback forms, users may indicate their phone numbers or the full name of the service personnel. Similarly, use regex to detect and replace such sensitive information. For instance, the following regex can be used to detect Singapore NRICs:
[sStTfFgG]\d{7}[a-zA-Z]
Source: enCRYPT

Common Field

Common fields are columns that establish relationships between tables. In method one, data is generated using a bottom-up approach. As such, there will be situations where the generated data is unable to be correlated with another, breaking the program.

A workaround to this issue to is to build a common field table, where a fixed set of values are generated beforehand. During the generation of synthetic data, refer to the common field table and retrieve the values if present. Otherwise, generate from scratch.

Source: weld

Time Constraints

When dealing with tables with dates, consider improving the usability of your synthetic data by taking care of source data restrictions. One such example is SAP time constraints. For instance, data generated should not have cases where the end time comes before the start time, and where an entity has two records happening at the same time.

Credit: erproof

Data Format

As a data engineer, you may be working with more than one platform. While generating synthetic data, take into consideration the behaviour of the different platforms. Some of the considerations are:

  1. Understand how multiple programming languages handle data (e.g. Python, MSSQL, MySQL)
  2. Ensure your program can parse different inputs (e.g. xlsx, csv, tsv)
  3. Maintain the original data format. For instance, null, nan, and empty strings should remain as it is and not be interchanged. If the input data is a float, it should output as a float with the same precision and scale, and not be converted into an integer

Testing Your Synthetic Data

It is highly recommended to check the generated synthetic data before handing it over to the users. A few ways to check them are:

  1. Manually opening each synthetic data and eyeballing the result
  2. Generating a report of all unique values in all columns. This speeds up the eyeballing process
  3. Generating test cases to assert data value and format, or to compare the set of values between the original and synthetic data

Reusability and Scalability

Especially while working in a small team, it is important to make your synthetic data generator reusable and scalable. Below are a few tips to enable this:

  1. Build a wrapper for your program
  2. Document your codes to facilitate debugging
  3. Encapsulate your codes so that it is easier to build new features
  4. Handle exceptions especially when the number of features grows. This will ensure that a broken input does not affect the rest of the inputs
  5. Generate logs to capture bugs

Statistics of Synthetic Data

Your use case decides the distribution used for the subset. To conceal the population statistics, use a uniform distribution. To maintain the population statistics, for instance, if the data follows a bell curve, use a normal distribution with a specific mean.

Quality of Synthetic Data

Garbage in, garbage out (GIGO) is the concept where flawed input results in poor quality output. Ensure that your input data is clean and readable before generating synthetic data.

Conclusion

We looked at two methods to generate synthetic data for data engineering, along with several considerations when creating your algorithm.

Domain knowledge is imperative. You have to first understand how your data is collected, stored, and used, before creating your algorithm. As important as it is for your program to be fast and data to be usable, you must pay special attention to data privacy. After all, you are aiming to generate synthetic data and avoid revealing real-world data.

When used correctly, synthetic data is a valuable tool in your arsenal that can be used to improve data privacy, lower cost, and scale your data for multiple projects.

In the next article, we will explain ways to generate synthetic data while correlating attributes within a dataset. Stay tuned!

--

--

Benedict Soh
d*classified

If you can't explain it to a six year old, you don't understand it yourself | https://www.linkedin.com/in/bensjx/