Generating Dummy Data with Python: A Practical Guide

Rahmadiyan
4 min readJun 27, 2023

--

Introduction

In many data-driven applications, it is often necessary to have sample data for testing, analysis, or demonstration purposes. However, manually creating large amounts of dummy data can be time-consuming and tedious. In this article, we will explore a Python script that automates the generation of dummy data with customizable columns and data types. By leveraging this script, you can quickly generate synthetic data for your projects. So let’s dive in and learn how to generate dummy data effortlessly.

Code Explanation

The provided code is a Python script that generates dummy data based on user-defined parameters. Let’s go through its main components and how they work.

Importing Required Modules

import csv
import random
import datetime
import os

The script starts by importing the necessary modules:

  • csv: Enables reading and writing CSV files.
  • random: Provides functions for generating random values.
  • datetime: Allows working with dates and time.
  • os: Provides functions for interacting with the operating system.

Defining the generate_dummy_data() Function

def generate_dummy_data():
try:
# Code for user input and data generation goes here
# ...
# ...
print(f"Dummy data generated and saved to {file_name}.")
except ValueError as e:
print("Invalid input:", str(e))
generate_dummy_data()

This function serves as the entry point for generating dummy data. It encompasses the entire data generation process and guides the user through inputting relevant parameters.

Getting User Input for Column Details

num_columns = get_valid_integer_input("Enter the number of columns (index column provided automatically): ")
num_data = get_valid_integer_input("Enter the number of data to generate: ")

column_names = ['No.']
column_data_types = ['index']
column_ranges = {}

for _ in range(num_columns):
name = get_valid_input("Enter column name: ", "Column name cannot be blank. Please enter a valid name.")
column_names.append(name)

data_type = get_valid_data_type_input(f"Enter data type for column '{name}' (1. integer, 2. date, or 3. string): ")
column_data_types.append(data_type)

if data_type == "date" or data_type == '2':
date_range_input = input(f"Enter date range for column '{name}' (DD/MM/YYYY to DD/MM/YYYY): ")
if date_range_input:
start_date, end_date = map(str.strip, date_range_input.split("to"))
column_ranges[name] = (datetime.datetime.strptime(start_date, "%d/%m/%Y").date(),
datetime.datetime.strptime(end_date, "%d/%m/%Y").date())
elif data_type == "string" or data_type == '3':
file_path = input(f"Enter .csv file path for column '{name}' data content (leave blank if there is none): ")
if file_path:
with open(file_path, "r") as file:
reader = csv.reader(file)
column_ranges[name] = [row[0] for row in reader if row]
elif data_type == "integer" or data_type == '1':
number_range_input = input(f"Enter number range for column '{name}' (start to end): ")
if number_range_input:
start_num, end_num = map(int, number_range_input.split("to"))
column_ranges[name] = (start_num, end_num)

The script prompts the user to enter the number of columns and the number of data records to generate. It then proceeds to gather information about each column, including the column name, data type, and optional constraints.

Generating the Data

data = []
for i in range(num_data):
row = []
for j in range(len(column_names)):
data_type = column_data_types[j]
name = column_names[j]
if data_type == "index":
row.append(str(i + 1))
elif data_type == "date":
start_date, end_date = column_ranges.get(name, (datetime.date.min, datetime.date.max))
random_date = random_date_in_range(start_date, end_date)
row.append(random_date.strftime("%d/%m/%Y"))
elif data_type == "string":
values = column_ranges.get(name, [])
if values:
row.append(random.choice(values))
else:
row.append(generate_random_word())
elif data_type == "integer":
start_num, end_num = column_ranges.get(name, (0, 100))
row.append(str(random.randint(start_num, end_num)))
data.append(row)

Once the user provides the necessary column details, the script generates the dummy data. It iterates over the specified number of data records and populates each row with values based on the column properties.

Saving Data to a CSV File

file_name = "dummy_data.csv"
if os.path.exists(file_name):
os.remove(file_name)
with open(file_name, "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(column_names)
writer.writerows(data)

The generated data is saved to a CSV file named “dummy_data.csv” using the csv.writer class. The column names are written as the first row, followed by the generated data rows.

Helper Functions

def get_valid_input(prompt, error_message):
value = input(prompt)
while not value:
print(error_message)
value = input(prompt)
return value


def get_valid_integer_input(prompt):
while True:
value = input(prompt)
try:
value = int(value)
if value <= 0:
print("Number must be greater than 0.")
else:
return value
except ValueError:
print("Invalid input. Please enter a valid integer.")


def get_valid_data_type_input(prompt):
valid_data_types = ["integer", "date", "string", "1", "2", "3"]
while True:
value = input(prompt)
if value not in valid_data_types:
print("Invalid input. Please enter a valid data type (integer, date, or string).")
else:
return value


def random_date_in_range(start_date, end_date):
delta = end_date - start_date
random_days = random.randint(0, delta.days)
return start_date + datetime.timedelta(days=random_days)


def generate_random_word(length=8):
letters = "abcdefghijklmnopqrstuvwxyz"
return "".join(random.choice(letters) for _ in range(length))

The script includes several helper functions to facilitate the data generation process. These functions handle tasks such as validating user input, generating random dates within specified ranges, generating random words, and more.

Conclusion

Generating dummy data is an essential task for various data-related activities. With the Python script provided in this article, you can automate the process of generating synthetic data tailored to your specific requirements. By following the code explanation and customizing it to suit your needs, you can effortlessly generate large volumes of dummy data for testing, analysis, or demonstration purposes.

Remember to exercise caution and responsibly handle any sensitive data when using dummy data in real-world scenarios.

Happy data generating!

--

--