Yet another way to generate fake datasets in BigQuery

Wrapping faker.js with a Javascript UDF

Patrick Dunn
Google Cloud - Community
3 min readAug 7, 2020

--

(Update 11/20/2020: Examples now use a generated a sequence table instead of using the wise_all_sky public dataset as a base table)

When creating demos and proofs of concept, I frequently run into the problem of finding suitable sample data to get started. BigQuery’s public datasets are a great resource, but rarely have everything that’s needed to build a proper environment.

Creating initial sample data has become a necessary and tedious process for many projects. Consequently fake data libraries, like faker.js or this one for python, have become popular in recent years. When using those libraries, I find myself slowed down by context switching between validating the generated data in BigQuery and tweaking the code I am using to import the data.

Here’s a quick tutorial on using a Javascript UDF with the faker.js library to generate billions of lines of fake data, in minutes, without leaving the BigQuery Console.

Generating the first record

The example below maximizes flexibility by using a temporary UDF that returns a struct type. The wildcard: ‘*’, used at the end of the select statement converts the function’s returned struct type into a normal table row. The temporary UDF can be quickly redefined to return any data within the scope of the faker.js library.

Single record output

BigQuery compatibility with large Javacript libraries can be finicky; a manually packaged webpack output for the UDF’s library can be found on GitHub.

Generating an arbitrary number of unique rows

In order for many rows to be created, the SQL select statement will need a table to select from; the row count of the result set will match the number of rows in the selected table.

In order to facilitate this capability you are going to generate a table containing an incremental sequence of values that can act as a random seed for faker to produce unique values.

Creating and using tables with billions of records can run up noticeable charges. Check out my earlier post on using Flex Slots to reduce your bill substantially, when working with large tables.

Generate 1 billion sequentially numbered rows in ~2 minutes

The row_seq table can be optionally seeded with 10’s of billions of records, enough for any use case. Additionally, by using the row number the generated results are both reproducible and unique.

Generate 100 million entities generated in ~ 4 minutes

A preview of the dataset.fake_accounts table shows each entity paired up with the seed that generated it. Any row can be reproduced by re-invoking the entity() function, i.e. select entity(30048188).* will reproduce Tiara Zulauf.

Fake entities reproducible by row_num

Additional Resources

Here are some additional resources that may help you.

--

--