At Omnata, we love to watch our prospects’ amazement as we demo the use of Salesforce to navigate through, and automate on, huge quantities of data in real-time from Snowflake, linked to individual records in their CRM. If you’re using both Snowflake and Salesforce and want to combine their strengths, you owe it to yourself to check us out!
Back to the topic though, it’s always nice to make the sample data look familiar and relatable. However, generating 100 million fake things (names, addresses, whatever) and importing them into Snowflake can be time consuming and error-prone.
Flaker is a Snowflake External Functions wrapper for the popular Faker python library. It means you can generate a vast array of fake data right from within Snowflake, in large quantities, very quickly and easily.
What kinds of data can it generate?
The full list of standard Faker providers is here: https://faker.readthedocs.io/en/master/providers.html
500 fake names, in the US English locale:
10 million addresses, in the Japanese locale:
Instructions to deploy the external functions can be found in the Github repo: https://github.com/jamesweakley/flaker
Performance / cost
To be honest, I haven’t tried to optimize it much yet because it’s already good enough for us. You can adjust the Lambda memory size in the serverless.yml file, and you can adjust the batch size in the Snowflake function definition.
Here’s a screenshot of the profile for the above 10M Japanese address generation running on an XS warehouse, a Lambda memory size of 1024MB, and the Snowflake function with a MAX_BATCH_ROWS of 100000:
So it sent 2442 requests to the API gateway, with an overall execution time of 5 minutes 20 seconds.
The Lambda cost for this was about 4c in USD, but don’t forget there’s a free usage tier of 1M requests and 400,000 GB-seconds per month. So you’d have to work pretty hard to get to the point where you’re paying for the AWS side.