FrostyGen: a Streamlit Data Generator Tool

Use Case

A few weeks ago, I was preparing a demo for a webinar when suddenly, I found myself quickly stuck — panic mode on. I needed a specific dataset; however, I was unable to find anything tailored for this use case. Without this dataset I couldn’t build a data engineering pipeline relevant for my target audience and the dashboards on top would miss the wow-effect.

So I decided to create the dataset myself by developing a data generation tool using Streamlit. Over the past days, I’ve been steadily adding various features, and in the end, I couldn’t resist giving a name to my very first Streamlit app: FrostyGen.

I believe that creating a data generator tool, whether simple or complex, is a milestone that every data engineer will encounter at some point in their career. While there are many existing tools available, the temptation to build your own will always be present because inevitably there will be one crucial feature (which you need) missing in tools already available.

I’ll humbly share at the end of this post the GitHub link, please be kind when judging my coding style. An old friend once told me: “If it works, don’t touch it. If it looks horribly formatted, don’t read it either”. Hopefully, it’s not that bad.

FrostyGen gives you the ability to generate data and push it effortlessly to Snowflake stages or tables, boosting your data preparation timelines and reducing the effort during your MVP/PoC first stages.

FrostyGen —Random Data Generator

Getting Started

Before diving into FrostyGen features, you need to connect to your Snowflake instance. This step is essential unless you’re planning to save the generated data locally on your machine (Export to File).

Defining the Export Type

You have three options at your disposal:

  • Save Files Locally: Choose this option if you want to generate data and save it as CSV files on your local system.
  • Snowflake Stage: Opt for this if you wish to push the generated data to a Snowflake stage.
  • Snowflake Table: Select this to push the records directly into a new or existing table within your Snowflake instance.
FrostyGen —Export Dataset Options

There is no need to create Snowflake stages or tables ahead of time — FrostyGen does it for you, automatically.

Defining Data Generation Parameters

In the main tab, you have a few options to tailor your data generation:

  • Specify the number of records you want to generate.
  • Define the number of fields per record (which you can then configure individually)
  • Choose your preferred separator and configure header settings as needed.

Define Generated Fields Types

For each field, you have the flexibility to select the data type from a predefined list. Depending on the chosen data type, you can then fine-tune parameters to generate data as precisely as needed.

  • DateTime: You can specify an initial date and a +/- range in days. The app will then randomly select dates within the specified range.
  • Text: You have two options here. You can either input a list (with one record per line) in a text area or generate strings randomly by specifying length, prefix, and suffix.
  • Database Columns: As I mentioned in the introduction, there’s often that one missing feature in existing data generators, and this was mine. I needed the capability to join the generated records with existing keys in other tables. This data type, once connected to a Snowflake account, enables you to select a specific column from your database to randomly pick values from that column.

For this feature, there’s a limit parameter to help you limit the amount of distinct values. It can be easily overwritten in the code

FrostyGen — DatabaseColumn Parameters

Additional data types currently available include Integer, Double, UUID.

Once all the fields are configured, you can click on the “Export Data” button. Regardless of the Export Option you have chosen, you will be able to check a small sample of generated records directly in the app.

FrostyGen — Generated Data Preview

Data Generation Strategies

To achieve the desired diversity in your dataset, you can run the generation process multiple times, each with different input parameters.

For instance, you can generate datasets with different numbers of output records in each run, allowing you to create heterogeneous datasets.

  • For numeric attributes like “Age” you might set different intervals or ranges for each run, producing datasets with diverse age distributions. For instance, you can generate data with ages ranging from 18 to 40 in one run and then from 41 to 75 in another, capturing different age demographics.
  • When dealing with a numerical attribute like “Transaction_Amount” you can specify different amounts or ranges in each run. This approach enables you to simulate diverse transaction scenarios, from small purchases to larger financial transactions, enhancing the realism and usefulness of your generated datasets.

This approach is tailored to work smoothly within the current limits set in the UI:

  • Max Number of Generated Records: 1 million (for each run)
  • Max Number of Fields: 20
  • Max Number of Distinct values for data type “DatabaseColumn”: 1000
#The limits can be modified by changing the code as follow: 
st.number_input([...], max_value=<new_value>)

#NOTE: Increasing max_value might cause performance degradation.

SiS (Streamlit in Snowflake) version

On September 18, 2023, while I was writing this story, SiS (Streamlit in Snowflake) moved from private preview to public preview stage (initially on AWS deployments). I couldn’t resist embedding this app directly into my Snowflake account.
The FrostyGen SiS version 1.0 is already available in the GitHub repository (“Save to File” option is not available yet on the SiS version).

Deploying FrostyGen on your Snowflake account is very simple:

  1. Download “frosty_gen_sis.py” and “logo.png” from the GitHub repository.
  2. Create a new Streamlit app on your Snowflake account
  3. Paste the code into your new app.
  4. Upload the “logo.png” in the Streamlit application stage.
FrostyGen — SiS Integration

In my next Medium story, I will share the experience of migrating the application and the challenges we may encounter along the way.

If you have any questions or suggestions or if you happen to discover a bug — I’m sure there might be a few well-hidden, please feel free to post them in the comments below or reach out to me on LinkedIn.

Ready to get started? Happy data generating!

Resources:

--

--

Matteo Consoli
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Sales Engineering @ Snowflake ❄️ | Data Engineering | Data Analytics | Data Science | Python | SQL | Book | Music Composer 🎼