Super Simple Guide to Generating Datasets for Data Analysis and Experimentation
This is less of a tutorial, more of a gist to allow me to copy and paste from when I need to quickly generate a random dataset with a mix of datatypes for data science explorations.
We only need numpy and pandas for this purpose (actually numpy alone is more than enough, but pandas just makes it easier to manage the generated data).
We first state how many samples (number of rows), and the features (columns of the dataset) that we want to generate.
num_rows = 500 # Number of rows/samples to generate
# Dataset headers
column_headers = ['No', 'Gender', 'Height', 'Weight', 'Shoe Size',
'Shopping Satisfaction Offline', 'Shopping Satisfaction Online',
'Average Spent Per Month']
In this case, we generate a dataset that simulates a set of survey returns on shopping preferences and some key personal data.
Let’s generate categorical data first. Numpy’s function random.choice is great for this. It allows us to generate samples based on a list of choices, with a pre-defined probability of each of the choices appearing in the dataset — e.g. if there are three possibilities — Male, Female and LGBT — I can specify that 40% of the sample are Male, 40% Female and 20% LGBT.
gender_list = ['Male', 'Female', 'LGBT']
gender = np.random.choice(gender_list, num_rows, p=[0.4, 0.4, 0.2])
satisfaction_list = ['High', 'Medium', 'Low']
satisfaction_offline = np.random.choice(satisfaction_list, num_rows, p=[0.4, 0.4, 0.2])
satisfaction_online = np.random.choice(satisfaction_list, num_rows, p=[0.3, 0.4, 0.3])
For generation of integers or floats, we can use the randint or uniform function. Other functions to generate numbers based on a predetermined distribution are also possible (e.g. normal).
satisfaction_list = ['High', 'Medium', 'Low']
satisfaction_offline = np.random.choice(satisfaction_list, num_rows, p=[0.4, 0.4, 0.2])
satisfaction_online = np.random.choice(satisfaction_list, num_rows, p=[0.3, 0.4, 0.3])
Finally, we can merge all of these into a pandas DataFrame and label the columns. We merge each of them as columns but it’s easier to just merge them as rows first and then transpose the matrix before labelling the columns.
responses = pd.DataFrame([no_answers, gender, height, weight, shoe_size,
satisfaction_offline, satisfaction_online,
monthly_spending])
responses = responses.transpose()
responses.columns = column_headers
The notebook on this can be found here.