Generating Sample Data Using the faker Library with Snowpark

The Snowflake sample database is a convenient way to get sample data quickly for demonstration purposes. But the sample data doesn’t appear realistic so is not always the best option when illustrating examples in presentations and case studies.

Image by Anja from Pixabay

I sometimes use the open-source faker library to generate fake but realistic-looking sample data with python. The way I do it is to generate a csv file with the sample data and then upload it to the database.

When I recently needed some realistic sample data in Snowflake for a case study, it was an excellent opportunity to explore how to use the faker library with Snowpark so that the sample data would be written to a Snowflake table directly, without a csv file in between.

Before I began, I checked if the faker package is available inside Snowflake using the following command:

SELECT * 
FROM information_schema.packages
WHERE language = 'python' AND package_name ilike '%faker%';

Good news, it’s available as a third-party package and ready to use.

I wanted to create a table with sample customer data with the following attributes for each customer:

  • ID
  • Name
  • Address
  • Telephone number
  • Social security number
  • Birth date

This can be easily generated using faker. To follow along how I went about it, let’s write some simple python code that generates 10 customer records according to the above structure:

import faker
fake = faker.Faker()
customers_str = fake.csv(data_columns=('{{name}}', '{{street_address}}', '{{phone_number}}', '{{ssn}}', '{{date}}'), num_rows=10, include_row_ids=True).replace('"', '')
print(customers_str)

The result is a comma-separated string with line breaks between each record that appear like this when printed to the output:

Output of 10 generated fake customers in python

Note that the faker library generates values randomly each time which means that you will get different values on subsequent runs of the code. You can use a seed to ensure consistent results if you wish.

Now let’s implement this code as a stored procedure using Snowpark.

We will do this in steps by first creating an initial draft of the stored procedure and building from there. The stored procedure will be named GEN_NAMES. It will take one parameter named ROWS_CNT of data type integer. This parameter will indicate how many rows to be generated.

We will include the faker package in the list of packages. The python function within the body of the stored procedure will be named gen. The stored procedure is coded as:

CREATE OR REPLACE PROCEDURE GEN_NAMES(ROWS_CNT INTEGER)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'faker')
HANDLER = 'gen'
AS
$$
def gen(session, ROWS_CNT):
import faker
fake = faker.Faker()
customers_str = fake.csv(data_columns=('{{name}}', '{{street_address}}', '{{phone_number}}', '{{ssn}}', '{{date}}'), num_rows=ROWS_CNT, include_row_ids=True).replace('"', '')
return(customers_str)
$$;

As you can see from the above code, the body of the gen() function is the same python code that we tested previously, with one change: the print() statement is replaced with the return() statement so that the procedure returns the output instead of printing it.

We can execute the procedure from the Snowflake user interface, providing the value 10 as the input parameter using this command:

CALL GEN_NAMES(10);

The procedure returns 10 randomly generated customer records, something like this:

Output of 10 generated fake customers from the Snowflake stored procedure

Now that we have a working initial stored procedure, let’s add code that will save this data to a Snowflake table.

There are several ways to do that, for example we could generate SQL statements to create a table in Snowflake and insert data, but that would not be very efficient. We could create a Pandas data frame and populate it from the string values using Pandas methods. This might be a good way to go if you are experienced in working with Pandas.

In this example, we will use Snowflake data frames to accomplish the task based on the Working with DataFrames in Snowpark Python documentation.

To construct a data frame in Snowpark we will use the snowflake.snowpark.Session.create_dataframe method which takes two parameters:

  • Data — the data used for building the data frame. We will provide it as a list (other options include tuples or a Pandas data frame). Each element in the list represents one row in the data frame and in turn each of these elements contains the columns of the row.
  • Schema — the names and data types of columns in the data frame, provided as a StructType object.

Let’s construct the schema of our data frame first. As per our requirements at the beginning, we will use these column names: ID, NAME, ADDRESS, PHONE, SSN and BIRTHDATE. The data type of the ID column is integer, the data type of the BIRTHDATE column is date and the rest are all string.

According to the documentation, a StructType object is a list of elements of type Struct Field, which in turn contain these parameters:

  • column_identifier — the column name
  • datatype — the column datatype using pre-defined data type names from the snowflake.snowpark.types module
  • nullable — whether the column is nullable as indicated by True or False

In our example the schema for the customers sample data is:

schema_for_customers = StructType([
StructField("ID", IntegerType(), False),
StructField("NAME", StringType(), False),
StructField("ADDRESS", StringType(), False),
StructField("PHONE", StringType(), False),
StructField("SSN", StringType(), False),
StructField("BIRTHDATE", DateType(), False)
])

To provide data to the Snowflake data frame, we will use the string value customers_str from the faker output and break it into a list of elements. We split the string value at the line break (‘\r\n’), omit the last line after the last line break by slicing [0:-1] then split each line by comma, like this:

[x.split(',') for x in customers_str.split('\r\n')[0:-1]]

Bringing it all together, this is how we construct a Snowflake data frame named cust_df from the customers_str string value, ensuring the required column names and data types are defined as in the schema_for_customers object:

cust_df = session.create_dataframe([x.split(',') for x in customers_str.split('\r\n')[0:-1]], schema_for_customers)

All that remains is to save this data frame as a table named FAKE_CUSTOMERS in Snowflake using the write property of the data frame:

cust_df.write.mode("overwrite").save_as_table("FAKE_CUSTOMERS")

Instead of hard-coding the table name in the save_as_table() property, we can provide it as a parameter named FAKE_TABLE_NAME to the stored procedure.

The complete code of the stored procedure now looks like this:

CREATE OR REPLACE PROCEDURE GEN_NAMES(FAKE_TABLE_NAME VARCHAR, ROWS_CNT INTEGER)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'faker')
HANDLER = 'gen'
AS
$$
def gen(session, FAKE_TABLE_NAME, ROWS_CNT):
import faker
from snowflake.snowpark.types import IntegerType, StringType, DateType, StructType, StructField

fake = faker.Faker()
customers_str = fake.csv(data_columns=('{{name}}', '{{street_address}}', '{{phone_number}}', '{{ssn}}', '{{date}}'), num_rows=ROWS_CNT, include_row_ids=True).replace('"', '')

schema_for_customers = StructType([
StructField("ID", IntegerType(), False),
StructField("NAME", StringType(), False),
StructField("ADDRESS", StringType(), False),
StructField("PHONE", StringType(), False),
StructField("SSN", StringType(), False),
StructField("BIRTHDATE", DateType(), False)
])

cust_df = session.create_dataframe([x.split(',') for x in customers_str.split('\r\n')[0:-1]], schema_for_customers)
cust_df.write.mode("overwrite").save_as_table(FAKE_TABLE_NAME)

return('Success')
$$;

To execute the stored procedure, for example to generate 100 customer records and save them to Snowflake in a table named FAKE_CUSTOMERS, we use the following command:

CALL GEN_NAMES('FAKE_CUSTOMERS', 100);

Then we review the data that has been generated and stored in the Snowflake table:

SELECT * FROM FAKE_CUSTOMERS;
Snowflake table with generated sample customers

We can also verify that the data types have been populated as per the provided schema, for example we can click the BIRTHDATE column heading in Snowsight to view the data type and distribution of this column:

Data distribution of the BIRTHDATE column

The above stored procedure can of course be modified as needed to fulfill various requirements for generating sample data.

--

--