So You Want to…. Import a File in Snowpark Python

Photo by Markus Spiske on Unsplash

Joining Snowflake through an acquisition has been a bit of a rollercoaster, and joining the Data Science Field CTO team has required me to get to grips with a lot of new functionality.

I’m obviously biased, but learning how to use Snowflake has been a lot easier than other data/cloud tools. This is partly down to everyone I’ve met at Snowflake being keen to learn and teach, but also down to the company’s one product philosophy making stitching things together as simple as possible.

Nevertheless, occasionally I would hit a roadblock and just not “get” how to do something. Documentation helps, but I’ve always found documentation to be much better at reminding you how to do something you already know rather than teach you from a blank slate (I’m looking at you pandas docs).

So, in this series “So You Want To…” blogposts I’m going to draw on my Snowflake journey and focus on the simple stuff that gets you up and running as soon as possible. These foundations should help you develop some momentum (and confidence) in Snowflake and Snowpark in short order.

First up, importing some data into Snowflake.

Prerequisites

  • A Snowflake Account (and a username, password etc.)
  • A csv file (can be anything you like, as long as you know where it is!)
  • A Python environment (that has permissions to connect to Snowflake and Snowflake’s Python API installed on it)

0. Housekeeping: Create your session, virtual warehouse and database

0.1 First up get your session up and running:

from snowflake.snowpark.session import Session

connection_parameters = {
"account": "your accountname", # insert your account name here
"user": "your username", # insert your username here
"password": "your password", # you get it...
"role": "ACCOUNTADMIN" # (or whatever role you want)
}
session = Session.builder.configs(connection_parameters).create()

Note — better to do the above via a json, or via a package like “getpass” but this is fine for now

0.2 Finally, get a virtual warehouse and a database setup

# Create a Virtual Warehouse
session.sql("CREATE OR REPLACE WAREHOUSE MY_FIRST_WH WITH WAREHOUSE_SIZE='X-SMALL'").collect()

# Create Databse
session.sql("CREATE OR REPLACE DATABASE MY_FIRST_DB").collect()

Conclusion

That’s it for required housekeeping, you have a session into your Snowflake account, a Virtual Warehouse (think of this as where compute happens) and a Database to store data in.

1. Manual ingest from a local drive

If you just want to get data into Snowflake ASAP then this might be for you. It won’t scale and it won’t earn you the respect and admiration of your colleagues, but you will get the job done.

Note I assume that you have set up up your session etc. per 0.1 and 0.2 above

1.1 First, write the table from your pandas DataFrame directly to Snowflake

import pandas as pd
pandas_df = pd.read_csv("my_csv.csv")
session.write_pandas(df=pandas_df,
table_name="MY_FIRST_TABLE_LOCAL",
overwrite=True,
auto_create_table=True)

1.4 Finally, test that it worked

snowpark_df = session.sql('''SELECT * FROM MY_FIRST_TABLE_LOCAL''')
snowpark_df.to_pandas()
# snowpark_df.limit(5).to_pandas() # with a limit of 5 records

Conclusion

Pretty simple really, and if you haven’t done all your (unscaleable) munging locally first you can start performing all your other transformations, joins, data science tasks etc. inside Snowflake. Unfortunately, it does rely on reading a file into pandas locally which is a little clunky and won’t be terribly performant.

2. Ingest from a Snowflake internal stage directly to a table

If you’re a little more patient this is a more proper approach, the data could be located anywhere, providing Snowflake has been given the right permissions (see here for AWS example). The core benefit here is that all the hard yards are done by Snowflake, and because the raw data is now in an Internal Stage you have an immutable copy of it in Snowflake, ensuring any and all transformations will be auditable (and reversible).

Note I assume that you have set up up your session etc. per 0.1 and 0.2 above

2.1 First, create a stage and put your file into the stage (presuming your data isn’t already in one)

# Create Stage
session.sql('CREATE OR REPLACE STAGE FILE_STAGE').collect()
session.file.put(local_file_name="my_csv.csv",
stage_location='@FILE_STAGE',
auto_compress=False,
overwrite = True)

2.2 Next, define a schema for the csv

from snowflake.snowpark.types import StructType, StructField, IntegerType, StringType, FloatType

# Define the schema for the data in the CSV file.
user_schema = StructType([StructField("col header 1", StringType()),
StructField("col header 2", IntegerType())])

2.3 Next, read the csv into a Snowpark DataFrame directly and write to Snowflake

# Create a DataFrame that is configured to load data from the CSV file.
df = session.read.options(
{"field_delimiter": ",",
"skip_header": 1}).schema(user_schema).csv("@file_stage/my_csv.csv")

df.write.mode("overwrite").save_as_table("MY_FIRST_TABLE_SNOWPARK")

2.4 Finally, test that it worked

snowpark_df = session.sql('''SELECT * FROM MY_FIRST_TABLE_SNOWPARK''')
snowpark_df.to_pandas()
# snowpark_df.limit(5).to_pandas() # with a limit of 5 records

Conclusion

A touch more involved, and the schema bit is a little tedious (generally I’d just import to pandas first to eyeball etc. like we did in (1), but that is not necessary). Still, it’s simple and painless.

3. Ingest From a Snowflake stage, read in a SPROC, and then write to a table

Sometimes you want to read data from an internal stage in Snowflake, do some kind of python based manipulation, then write to a table. Using a SPROC lets you push all this manipulation to the Snowflake compute environment. Unfortunately, because it is a SPROC, it is bound to a single node (more on this and Dynamic File Access in later blog posts). Nevertheless, this should get you up and running with the data including supporting any transformations you need to make. It’s also the easiest way to lift and drop Python code into Snowflake.

Note I assume that you have set up up your session etc. per 0.1, 0.2 and 2.1 above

3.1 First, create a stage for your stored procedure (SPROC) to live

session.sql('CREATE OR REPLACE STAGE SPROC_STAGE').collect()

Note you don’t have to setup a distinct stage for your SPROC, you could just put it into the FILE_STAGE location. This keeps things cleaner and easier to manage/locate files though

3.2 Next, create a SPROC to read the data in

@sproc(name='FILE_OPEN', 
packages=['snowflake-snowpark-python', 'pandas'],
is_permanent=True,
replace=True,
stage_location='@FILE_STAGE',
session=session)
def file_open(session: Session, filename: str, target_table: str) -> str:
import pandas as pd

target_loc = 'tmp'
_ = session.file.get(filename, target_loc)
tmp_file_loc = target_loc + '/' + filename.split('/')[1]

# Read to pandas
pandas_df = pd.read_csv(tmp_file_loc)

# Insert some pandas manipulation here
# pandas_df["some new col"] = pandas_df["some existing col"]

# Write to Snowflake
session.write_pandas(df=pandas_df,
table_name=target_table,
overwrite=True,
auto_create_table=True)

return f'Successfully written {filename} to table {target_table}'

3.2 Next, Execute Your SPROC

file_open("@file_stage/my_csv.csv", 'MY_FIRST_TABLE_SPROC')

3.3 Finally, test that it worked

snowpark_df = session.sql('''SELECT * FROM MY_FIRST_TABLE_SPROC''')
snowpark_df.to_pandas()
# snowpark_df.limit(5).to_pandas() # with a limit of 5 records

It’s a little less compact than the other approaches, and in this instance it’s unnecessary to use a sproc to read in a csv via pandas, but gives you an ultra simple example of opening up a file inside a SPROC, doing something with it, and then writing to Snowflake.

Wrapping Up

This is not even close to an exhaustive list of file access patterns for Snowflake. For example, it doesn’t cover the more advanced External Tables topic, but it should get you up and running in Snowflake and Snowpark in short order with the most basic filetype out there.

Also, stay tuned for Dynamic File Access, in Private Preview since February 2023 (I’ll update this blog when it reaches Public Preview)

--

--

Michael Taylor
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Equal parts data scientist, consultant, data privacy wonk, animal lover, basketball coach/player and cook. Thoughts are my own, not my employers