Generating realistic-looking fake data in Snowflake

James Weakley
Oct 7, 2020 · 2 min read

At Omnata, we love to watch our prospects’ amazement as we demo the use of Salesforce to navigate through, and automate on, huge quantities of data in real-time from Snowflake, linked to individual records in their CRM. If you’re using both Snowflake and Salesforce and want to combine their strengths, you owe it to yourself to check us out!

Back to the topic though, it’s always nice to make the sample data look familiar and relatable. However, generating 100 million fake things (names, addresses, whatever) and importing them into Snowflake can be time consuming and error-prone.

Enter Flaker!

Flaker is a Snowflake External Functions wrapper for the popular Faker python library. It means you can generate a vast array of fake data right from within Snowflake, in large quantities, very quickly and easily.

What kinds of data can it generate?

The full list of standard Faker providers is here: https://faker.readthedocs.io/en/master/providers.html

Examples

500 fake names, in the US English locale:

Image for post
Image for post

10 million addresses, in the Japanese locale:

Image for post
Image for post

Getting Started

Instructions to deploy the external functions can be found in the Github repo: https://github.com/jamesweakley/flaker

Performance / cost

To be honest, I haven’t tried to optimize it much yet because it’s already good enough for us. You can adjust the Lambda memory size in the serverless.yml file, and you can adjust the batch size in the Snowflake function definition.

Here’s a screenshot of the profile for the above 10M Japanese address generation running on an XS warehouse, a Lambda memory size of 1024MB, and the Snowflake function with a MAX_BATCH_ROWS of 100000:

Image for post
Image for post

So it sent 2442 requests to the API gateway, with an overall execution time of 5 minutes 20 seconds.

The Lambda cost for this was about 4c in USD, but don’t forget there’s a free usage tier of 1M requests and 400,000 GB-seconds per month. So you’d have to work pretty hard to get to the point where you’re paying for the AWS side.

Next Steps

Snowflake

Articles for engineers, by engineers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store