Automate the Deployment of Snowflake UDFs with Gradle

A funny thing happened on the way to the Snowpark

Update for version 1.0.0

I introduced breaking changes in version 1.0.0 of the plugin. In preparation for supporting non-JVM languages, the snowflakePublish task was renamed to snowflakeJvm , which is specific to building any currently supported JVM languages. I did not update the body of the story below, but I updated the README on GitHub and the API docs.

Getting my hands dirty with Snowpark

I’ve been working heavily with Snowpark, testing both the Java and Scala flavors. I love DataFrames, having used Apache Beam and Apache Spark in the past. User-defined functions (UDFs) are a great abstraction with DataFrames, especially for complex logic that relies on external libraries. And although Snowpark allows the definition of inline UDFs for Scala, the inability to define unit tests makes this pattern a non-starter for anything save the most trivial use cases. So I decided to use the traditional method for defining UDFs with Snowflake, which ruled out Scala, so…

Just when I thought I was out…

The great thing about Snowflake is how easy it is to integrate polyglot languages into a unified application. It makes no difference if I write DataFrames in Scala but then pair them with UDFs written in Java. I would have preferred to write UDFs in Scala, but based on my tests and the documentation, only the Snowpark inline type is supported with Scala. So… we meet again, Java.

Java, my old friend

It needs to be easy to develop and test Java applications even if they are deployed to Snowflake as UDFs. Using Gradle, we can build shaded JAR files with dependencies using the shadow plugin, and I’ve provided a sample project that demonstrates this basic use case. I used the sample code from Danny Bryant’s blog post on Java UDFs and Streamlit for these examples (with permission), which he was building without Gradle:

gh repo clone stewartbryson/gradle-snowflake
cd gradle-snowflake/examples/simple-jar
./gradlew build
ls -ltr build/libs/
total 4200
-rw-r--r-- 1 stewartbryson staff 1018 Sep 12 13:15 internal-stage-0.1.0.jar
-rw-r--r-- 1 stewartbryson staff 2144484 Sep 12 13:15 internal-stage-0.1.0-all.jar

This command builds two JARs: one with dependencies and one without; for Snowflake, we are interested in the shaded internal-stage-0.1.0-all.jar artifact. This JAR would still have to be uploaded to a stage in Snowflake, and the UDF would have to be created or possibly recreated if its signature changed. This all feels so unnatural. I want the deployment of Java code to Snowflake to feel as natural as it does when developing libraries and applications for application servers or containers.

Enter the Gradle Snowflake Plugin

View on GitHub

Everything looks like a nail when you have a hammer, and my hammer tends to be Gradle plugins. This new plugin provides easy configuration options for those getting started with Gradle but also advanced features for teams already using Gradle in other areas of the organization. It has three basic modes:

  1. Lightweight publishing to internal Snowflake stages using Snowpark.
  2. Slightly heavier publishing using external Snowflake stages and auto-configuration of the maven-publish plugin.
  3. Publishing to Snowflake using external stages and manual configuration of the maven-publish plugin.

Have a look at the API docs.

Thor: God of Thunder, Issue #10, July 17, 2017, Jason Aaron and Esad Ribic.

Internal Stages using Snowpark

Unless you have a heavy investment in Gradle as an organization, this is likely the option you want to use. Additionally, if you plan on sharing UDFs across Snowflake accounts, this is the option you must use, as JARs need to be in named internal stages. Look at the sample project and you’ll notice a few differences in the build file. We applied io.github.stewartbryson.snowflake and removed com.github.johnrengelman.shadow because the shadow plugin is automatically applied by the snowflake plugin:

plugins {
id 'java'
id 'com.github.ben-manes.versions' version '0.42.0'
id 'io.github.stewartbryson.snowflake' version '0.2.11'
}

We now have the snowflakePublish task available:

cd ../internal-stage
./gradlew help --task snowflakePublish
> Task :help
Detailed task information for snowflakePublish
Path
:snowflakePublish
Type
SnowflakePublish (io.github.stewartbryson.SnowflakePublish)
Options
--account Override the URL of the Snowflake account.
--database Override the Snowflake database to connect to. --jar Optional: manually pass a JAR file path to upload instead of relying on Gradle metadata. --password Override the Snowflake password to connect with. --role Override the Snowflake role to connect with. --schema Override the Snowflake schema to connect with. --stage Override the Snowflake stage to publish to. --user Override the Snowflake user to connect as. --warehouse Override the Snowflake role to connect with.Description
A Cacheable Gradle task for publishing Java-based applications as UDFs to Snowflake.
Group
publishing
BUILD SUCCESSFUL in 597ms
1 actionable task: 1 executed

Several command-line options mention overriding other configuration values. This is because the plugin provides a configuration extension called snowflake that we can use to configure our build so that we don’t have to pass command-line options. This extension is documented in the SnowflakeExtenion API:

snowflake {
// provided in my local gradle.properties file
// account = <snowflake account url>
// user = <snowflake user>
// password = <snowflake password>
role = 'stewart_role'
database = 'stewart_db'
schema = 'developer'
stage = 'upload'
applications {
add_numbers {
inputs = ["a integer", "b integer"]
returns = "string"
handler = "Sample.addNum"
}
// add additional functions or procedures
}
}

Notice that I’m not hard-coding sensitive credentials. Instead, they are in my local gradle.properties file and any of the plugin configs can be provided this way or other ways using Gradle project properties for CI/CD pipelines:

# local ~/.gradle/gradle.properties
snowflake.url = https://myorg.snowflakecomputing.com:443
snowflake.user = myusername
snowflake.password = mypassword

The nested applications closure might seem a bit daunting. This is a simple way to use DSL to configure all the different UDFs we want to automatically create (or recreate) each time we publish the JAR file. The example above will generate and execute the statement:

CREATE OR REPLACE function add_numbers (a integer, b integer)
returns string
language JAVA
handler = 'Sample.addNum'
imports = ('@upload/libs/internal-stage-0.1.0-all.jar')

Notice that we did not have to specify the imports portion of the statement; the plugin automatically generates this. With our configuration complete, we can execute the snowflakePublish task, which will run any unit tests and then publish our JAR and create our function:

./gradlew snowflakePublish> Task :snowflakePublish
File internal-stage-0.1.0-all.jar: UPLOADED
Deploying ==>
CREATE OR REPLACE function add_numbers (a integer, b integer)
returns string
language JAVA
handler = 'Sample.addNum'
imports = ('@upload/libs/internal-stage-0.1.0-all.jar')
BUILD SUCCESSFUL in 10s
3 actionable tasks: 3 executed

Our function now exists in Snowflake:

select add_numbers(1,2);

The snowflakePublish task was written to be incremental and cacheable. If we run the task again without making any changes to task inputs (our code) or outputs, then the execution is avoided, which we know because of the up-to-date keyword.

./gradlew snowflakePublish              BUILD SUCCESSFUL in 624ms
3 actionable tasks: 3 up-to-date

Auto-configuration of maven-plugin with External Stages

This option is useful when you want your artifacts available to consumers other than just Snowflake without publishing them to disparate locations. Gradle has built-in support for S3 as a Maven repository, and Snowflake has support for S3 external stages, so we simply marry the two in a single location. Looking at the sample project, notice we’ve populated a few additional properties:

snowflake {
// provided in my local gradle.properties file
// account = <snowflake account url>
// user = <snowflake user>
// password = <snowflake password>
// publishUrl = <S3 bucket and path>
role = 'stewart_role'
database = 'stewart_db'
schema = 'developer'
stage = 'maven'
groupId = 'io.github.stewartbryson'
artifactId = 'sample-udfs'

applications {
add_numbers {
inputs = ["a integer", "b integer"]
returns = "string"
handler = "Sample.addNum"
}
}
}

The groupId and artifactId, plus the built-in version property that exists for all Gradle builds, provide the Maven coordinates for publishing externally to S3. I've also created a property in my local gradle.properties file for the bucket:

# local ~/.gradle/gradle.properties
snowflake.publishUrl = s3://myrepo/release

The plugin doesn’t create the stage, but it does do a check to ensure that the Snowflake stage metadata matches the value in publishUrl. We get a few new tasks added to our project:

cd ../external-stage
./gradlew tasks --group publishing
> Task :tasks------------------------------------------------------------
Tasks runnable from root project 'external-stage'
------------------------------------------------------------
Publishing tasks
----------------
generateMetadataFileForSnowflakePublication - Generates the Gradle metadata file for publication 'snowflake'.
generatePomFileForSnowflakePublication - Generates the Maven POM file for publication 'snowflake'.
publish - Publishes all publications produced by this project.
publishAllPublicationsToMavenRepository - Publishes all Maven publications produced by this project to the maven repository.
publishSnowflakePublicationToMavenLocal - Publishes Maven publication 'snowflake' to the local Maven repository.
publishSnowflakePublicationToMavenRepository - Publishes Maven publication 'snowflake' to Maven repository 'maven'.
publishToMavenLocal - Publishes all Maven publications produced by this project to the local Maven cache.
snowflakePublish - A Cacheable Gradle task for publishing Java-based applications as UDFs to Snowflake.
To see all tasks and more detail, run gradle tasks --allTo see more detail about a task, run gradle help --task <task>BUILD SUCCESSFUL in 600ms
1 actionable task: 1 executed

These are granular tasks for building metadata and POM files and publishing that along with the artifacts to S3. But the snowflakePublish task initiates all these dependent tasks, including publishSnowflakePublicationToMavenRepository which uploads the artifact for us but, unfortunately, doesn’t provide console output to that effect:

./gradlew snowflakePublish --console=plain
> Task :compileJava
> Task :processResources NO-SOURCE
> Task :classes
> Task :shadowJar
> Task :compileTestJava NO-SOURCE
> Task :processTestResources NO-SOURCE
> Task :testClasses UP-TO-DATE
> Task :test NO-SOURCE
> Task :generatePomFileForSnowflakePublication
> Task :publishSnowflakePublicationToMavenRepository
> Task :snowflakePublish
Deploying ==>
CREATE OR REPLACE function add_numbers (a integer, b integer)
returns string
language JAVA
handler = 'Sample.addNum'
imports = ('@maven/io/github/stewartbryson/sample-udfs/0.1.0/sample-udfs-0.1.0-all.jar')
BUILD SUCCESSFUL in 12s
5 actionable tasks: 5 executed

Manual configuration of maven-plugin with External Stages

For organizations that already use maven-publish extensively, or have customizations outside the scope of auto-configuration, the plugin supports disabling auto-configuration:

snowflake {
// additional configs
groupId = 'io.github.stewartbryson'
artifactId = 'sample-udfs'
useCustomMaven = true
}

We then configure publications and repositories as described in the maven-publish documentation, and add task dependencies for the snowflakePublish task. The publishUrl property is no longer required because it's configured in the publications closure, but if provided, the plugin will ensure it matches the metadata for the stage property. groupId and artifactId are still required so that snowflakePublish can autogenerate the imports section of the CREATE OR REPLACE... statement.

Roadmap: feature requests and PRs welcome

I already have a few enhancements in mind, including support for GCS external stages, automation of ephemeral testing environments using Snowflake cloning, and support for Scala (if and when Snowflake supports this). I’d love to see broader adoption of this plugin, so feel free to submit feature requests and add pull requests for those features if you can. You can also raise a GitHub issue if you just have a general question.

UPDATE: Support for GCS external stages and ephemeral testing environments has been added. Also, using other JVM languages like Scala, Groovy, and Kotlin is also supported.

It takes a village.

--

--

Stewart Bryson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Snowflake Data Superhero | Oracle ACE Alum | Writer, speaker, podcast guest | Amateur cyclist | Professional philosopher