From Glue to Snowpark…
Disclaimer: I am a Principal Solution Architect at Snowflake . The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or future employers.
Recently I found myself in need of taking some code from amazon glue scripts to snowpark.
I was excited when I saw this post about using SMA with some AWS Glue examples so I wanted to give it try.
Some of the code was using DataFrame API to perform its logic but… some code was mostly using Glue specific functions.
I wanted the migration experience to become even easier. So I decided to build a small set of helpers so we can take this logic quickly into Snowpark.
Introducing the SF Jobs helpers
The code for this helpers is available under the extras folder in the snowpark_extensions repository.
I built the SF Jobs as a set of compatibility helpers for AWS glue code. They are meant to be almost a drop in replacement so other than adjusting some imports your code should remain almost the same.
There are some considerations.
- Glue has a concept of DynamicFrames. For this library DynamicFrames will be mapped to Snowpark DataFrames. When loading data, for example when you load an CSV the snowpark infer_schema capabilities (see example #12) will be used which in those cases will turn the column in to a string the VARIANT capabilites will be used to mimic the DynamicFrames approach
- Glue transforms are performed with Snowpark DataFrame constructs, the helper just allows you to keep the same Glue syntax.
- Case sensitivity is an important topic. In Snowflake the metadata is stored in uppercase and unless quoted Snowflake will always turn the identifier to uppercase before executing. Some extensions were added to allow specifying columns in lowercase and internally the helpers will do a case insensitive search when enabled. Tick quoting like
`id`
is just turned as"id"
before execution. - A lot lot of the Glue actions are done directly agains S3 buckets. You can do a lot of the same in Snowflake but maybe you can consider using snowflake tables or Iceberg tables as a better alternative. In my example I built a S3 integration that gives me full access to that S3 bucket and any prefix below it which makes more sense to do similar Glue operations.
- I think of this library as a good way to get some code quickly running in Snowflake in a lift-and-shift approach but strongly recommend to consider looking at some native options in Snowflake like auto_ingest snowpipes or triggered tasks and dynamic tables.
Now lets see it in action
For my example I will be using a notebook. I am really enjoying the notebooks feature :).
Getting the Notebook Ready
First create your notebook:
Before starting lets make sure we have the references we need:
Go to Anaconda packages and add pandas
and snowflake-snowpark-python
Now let's add our sfjobs
lib. This library is not in Anaconda so we need to build and add it.
If you downloaded the code get into the glue_helper folder and run the snow-cli command to build it:
Or you can download it from the github releases.
Go to your database and find the stage:
Then click the add files:
And select your sfjobs.zip
file:
Once the file is uploaded go back to your notebook and from stage packages add the sfjobs.zip
file, adjust the location with your corresponding stage name
Setup the Imports
The notebook is initialized with some sample cells, delete them first and lets start with a python cell to setup some imports:
These imports just get some command functions in the context.
As you can see the imports are purposely similar to the old glue imports.
We also create create a new session, and get import for functions and types in case we need them.
Getting some data
With a previously created s3 integration I setup an stage.
This is just temporal stage as I only will be needing it for this notebook.
Lets load the medicare dataset.
I recommend the step in line 4 just to make sure all columns are uppercase that makes things easier.
As you can see PROVIDER ID
field is loaded as an string because in this glue example some data was added to illustrate that this column had integer and string data.
And let's save it into a table:
We can use DynamicFrame similar apis to load the data back:
As mentioned earlier DynamicFrames in this implementation are just DataFrames so this schema only shows one datatype but we will be still able to do some "dynamic" type handling
Resolving Conflicts
A typical operation is to use resolveChoice
to try to solve data inconsistencies:
The resolve choice is used and we can notice that the 2 anomalous records can be removed using the resolveChoice
Using case_insensitive resolution and UDFs
Now let's enable case_insensitive resolution, and lets apply an UDF. Snowpark supports UDF almost identically as you can see in Glue, just remember to specify input_types
and returnsolveChoice
return_type.
Column Renaming and Nesting
What about renaming and nesting columns. Will we can the same familiar operations
Writing back to S3
Sure we can also do that:
Notice the additional
storage_integration
parameter. However I will think that you might consider saving directly into an snowflake table or use an icerberg table if you still want your data stored in S3
More resolveChoices, like make_cols or make_struct?
Final Thoughts
Thank you for reading this far. The purpose of this helper is to demonstrate some of the powerful features enabled by Snowpark.
Once you have your equivalent notebook from glue to snowpark you can just easily create an schedule and get it going:
As always I hope you find this content helpful and inspirational.