From Glue to Snowpark…

Mauricio Rojas
6 min readJul 6, 2024

--

Disclaimer: I am a Principal Solution Architect at Snowflake . The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or future employers.

Recently I found myself in need of taking some code from amazon glue scripts to snowpark.

I was excited when I saw this post about using SMA with some AWS Glue examples so I wanted to give it try.

Some of the code was using DataFrame API to perform its logic but… some code was mostly using Glue specific functions.

I wanted the migration experience to become even easier. So I decided to build a small set of helpers so we can take this logic quickly into Snowpark.

Introducing the SF Jobs helpers

The code for this helpers is available under the extras folder in the snowpark_extensions repository.

I built the SF Jobs as a set of compatibility helpers for AWS glue code. They are meant to be almost a drop in replacement so other than adjusting some imports your code should remain almost the same.

There are some considerations.

  • Glue has a concept of DynamicFrames. For this library DynamicFrames will be mapped to Snowpark DataFrames. When loading data, for example when you load an CSV the snowpark infer_schema capabilities (see example #12) will be used which in those cases will turn the column in to a string the VARIANT capabilites will be used to mimic the DynamicFrames approach
  • Glue transforms are performed with Snowpark DataFrame constructs, the helper just allows you to keep the same Glue syntax.
  • Case sensitivity is an important topic. In Snowflake the metadata is stored in uppercase and unless quoted Snowflake will always turn the identifier to uppercase before executing. Some extensions were added to allow specifying columns in lowercase and internally the helpers will do a case insensitive search when enabled. Tick quoting like `id` is just turned as "id" before execution.
  • A lot lot of the Glue actions are done directly agains S3 buckets. You can do a lot of the same in Snowflake but maybe you can consider using snowflake tables or Iceberg tables as a better alternative. In my example I built a S3 integration that gives me full access to that S3 bucket and any prefix below it which makes more sense to do similar Glue operations.
  • I think of this library as a good way to get some code quickly running in Snowflake in a lift-and-shift approach but strongly recommend to consider looking at some native options in Snowflake like auto_ingest snowpipes or triggered tasks and dynamic tables.

Now lets see it in action

For my example I will be using a notebook. I am really enjoying the notebooks feature :).

Getting the Notebook Ready

First create your notebook:

Creating a new notebook
Specify notebook name, location and warehouse

Before starting lets make sure we have the references we need:

Go to Anaconda packages and add pandas and snowflake-snowpark-python

Adding Anaconda Packages

Now let's add our sfjobs lib. This library is not in Anaconda so we need to build and add it.

If you downloaded the code get into the glue_helper folder and run the snow-cli command to build it:

Or you can download it from the github releases.

Go to your database and find the stage:

Then click the add files:

And select your sfjobs.zip file:

Once the file is uploaded go back to your notebook and from stage packages add the sfjobs.zip file, adjust the location with your corresponding stage name

Setup the Imports

The notebook is initialized with some sample cells, delete them first and lets start with a python cell to setup some imports:

These imports just get some command functions in the context.

As you can see the imports are purposely similar to the old glue imports.
We also create create a new session, and get import for functions and types in case we need them.

Getting some data

With a previously created s3 integration I setup an stage.

This is just temporal stage as I only will be needing it for this notebook.

Lets load the medicare dataset.

I recommend the step in line 4 just to make sure all columns are uppercase that makes things easier.

As you can see PROVIDER ID field is loaded as an string because in this glue example some data was added to illustrate that this column had integer and string data.

And let's save it into a table:

We can use DynamicFrame similar apis to load the data back:

As mentioned earlier DynamicFrames in this implementation are just DataFrames so this schema only shows one datatype but we will be still able to do some "dynamic" type handling

Resolving Conflicts

A typical operation is to use resolveChoice to try to solve data inconsistencies:

The resolve choice is used and we can notice that the 2 anomalous records can be removed using the resolveChoice

Using case_insensitive resolution and UDFs

Now let's enable case_insensitive resolution, and lets apply an UDF. Snowpark supports UDF almost identically as you can see in Glue, just remember to specify input_types and returnsolveChoicereturn_type.

Column Renaming and Nesting

What about renaming and nesting columns. Will we can the same familiar operations

Writing back to S3

Sure we can also do that:

Notice the additional storage_integration parameter. However I will think that you might consider saving directly into an snowflake table or use an icerberg table if you still want your data stored in S3

More resolveChoices, like make_cols or make_struct?

Final Thoughts

Thank you for reading this far. The purpose of this helper is to demonstrate some of the powerful features enabled by Snowpark.

Once you have your equivalent notebook from glue to snowpark you can just easily create an schedule and get it going:

As always I hope you find this content helpful and inspirational.

--

--

Mauricio Rojas

Engineer with a passion for software design & AI. Expert in Java, C++, & big data. MSc in Computer Science. Currently exploring SnowPark.