Big Fun with BigQuery: How to Generate Dummy Data in a Snap!

Akash Gupta
Plumbers Of Data Science
3 min readOct 11, 2020

When working on presentations or proof of cases, one generally runs into the issue of finding suitable dummy data to start with. Although Bigquery has been excellent with its public dataset to an extent, seldom does it satisfy our needs.

In the recent past, many data libraries like faker.js have gained massive outreach but they appear to slow down between validating the generated data in Bigquery with the modified code to generate it.

One quick way to generate tons of dummy data in Bigquery without going off the Bigquery console has been talked about below.

Populating single row

Before populating the records, ensure you have uploaded the faker.js into your Google Cloud Storage bucket.

In the below example, we are using a temp UDF with return type as a struct. Along with it, you can see wildcard - ‘*’ too has been used which converts the function’s return struct type into a normal table record.

Note: The below UDF can be easily modified to return any data within the range of the faker.js library.

CREATE TEMP FUNCTION dummy()
RETURNS STRUCT<name String, email String, comp_name String>
LANGUAGE js
AS """
var functioncall = getFaker()
var dummyvalue = {};
dummyvalue.name = functioncall.name.findName();
dummyvalue.email = functioncall.internet.email();
dummyvalue.comp_name = functioncall.company.companyName()
return dummyvalue;
"""
OPTIONS (
library=["<bucket_name>/faker.js"]
);
SELECT dummy().*;

Populating n number of records

For populating n rows, the select statement has to be linked with a table.

Note: The record count of the target table will match the number of records in the select source table.

One can utilize the public dataset tables as the base of any use case data generation.

Bonus: Since no columns are selected from the select table, it will not scan the table. Hence it will be cost-free.

CREATE TEMP FUNCTION
dummy()
RETURNS STRUCT<name String,
email String,
comp_name String>
LANGUAGE js AS """
var functioncall = getFaker()
var dummyvalue = {};
dummyvalue.name = functioncall.name.findName();
dummyvalue.email = functioncall.internet.email();
dummyvalue.comp_name = functioncall.company.companyName()
return dummyvalue;
""" OPTIONS ( library=["<bucket_name>/faker.js"] );
CREATE OR REPLACE TABLE
`project`.dataset.dummy_table AS (
SELECT dummy().* FROM `bigquery-public-data.wise_all_sky_data_release.mep_wise` LIMIT 1000 );
Preview of dummy_table

In order to generate consistent results, one can provide any numeric value and generate consistently by selecting the same rows.

Note: the wise_all_sky_data_release tables are very big and will run up heavy charges once the columns are selected.

--

--

Akash Gupta
Plumbers Of Data Science

Data Engineering with a Sense of Humor: ओ bug कल आना!