Snowpark meets Azure DevOps to enable automation of building data pipelines

Snowpark and Java UDFs are Snowflake Data Cloud’s new development experience features supporting Scala as well as Java and Python for many workloads including data engineering and data science. Snowpark does not use any Spark clusters underneath. Snowpark leverages Snowflake Data Cloud’s powerful virtual warehouses with unlimited scalability for computing, providing an excellent alternative to Spark. Similar to Spark, Snowpark uses DataFrame APIs and lazy execution. Unlike with Spark, developers spend less time on performance tuning of jobs and more time on activities that provide business value, such as implementing Data Quality checks, CI/CD, and observability around data pipelines.

Snowpark is a great option for teams adopting DevOps principles for building data pipelines allowing for change control, automation of building, testing, and deploying code as well as promotion of best practices for software engineering principles such as modularity and reusability. Azure DevOps, on the other hand, is a top-notch DevOps offering for the Azure Cloud community for implementing end-to-end CI/CD pipelines. Azure DevOps Services offer all the necessary capabilities to integrate Agile practices into DataOps such as project management, source control, artifact management, secret management, and automation of testing as well as building CI/CD pipelines to automate release cycles. Azure DevOps also offers a plethora of plugins in their Marketplace (mostly free), which is super cool.

Photo by Lance Asper on Unsplash

In a real-world DataOps implementation, you would need to define a CI/CD process with a branching strategy, multiple environments (DEV, TEST, QA, UAT, PROD) as well as deployment/release methodologies, which I will not cover in this post. (Snowflake’s Professional Services (PS) as well as many of our wonderful SI partners can also help architect these solutions and define these processes.) I will, however, go over the steps of implementing a simple DataOps pipeline using Snowpark Scala API and Azure DevOps to get you started.

First, I wrote the following Scala code example to ingest a sample customer dataset and apply some transformations using Snowpark. (Snowpark documentation can be found here.) I also developed a unit test for one of the transformation functions so I can run my unit tests in my build pipeline every time there is a code change. I checked these files into a repository called snowpark_demo in my Azure DevOps project.

Below is a simple Scala code to load my customer data from a stage:

LoadData.scala

After we load the data into a CUSTOMER table, the TransformData object applies some transformations to the customer data.

TransformData.scala

Here is a simple unit test to test the applyGroupBy() function. I am using scalatest to develop unit tests in this example.

TransformerTest.scala

I also added a build.sbt, plugins.sbt as well as azure-pipelines.yml to my Azure Repo, which is essentially my pipeline-as-code to build and deploy code in Azure DevOps.

build.sbt
plugins.sbt
azure-pipelines.yml
Azure Repos folder structure

After you check all these files into Azure Repos following the same folder structure in the screenshots above, you can click on the Pipelines in the left sidebar of Azure DevOps, select the pipeline with the name of your repository (snowpark-demo, in my case), click on the Edit button and click on the Variables button to define all the Snowflake connection variables as well as the password as a secret in Azure Key Vault. You can find the instructions for Azure Key Vault integration here. sbt commands are using these variables in the azure-pipelines.yml instead of hard-coding these.

Defining the Snowflake connection variables in the Azure DevOps pipeline

You can also create different variables for each environment (Dev, QA, PROD) in your implementation.

Checking code into the branch will automatically trigger the execution of my CI/CD pipeline in Azure DevOps in this simple example. In real-world implementations, a multi-branch strategy with Pull Requests (and peer reviews/approvals) is the recommended practice before checking code into the main branch.

When I click on Pipelines in the left sidebar, I can see that my CI pipeline is running.

We can click on the pipeline to see all the steps we defined in azure-pipelines.yml (building, testing, packaging, and execution of the Snowpark code) getting executed.

Once the execution is complete, we can click on the ‘sbt test’ step to see that my unit test passed. (Otherwise, the whole CI pipeline would have failed)

sbt test output

The subsequent steps to create a jar file and execution of two scala classes are also completed successfully.

sbt package output

In the ‘sbt run’ step, we can also see the SQL generated by Snowpark in the logs:

sbt run output for loading data with Snowpark
sbt run output for transforming data with Snowpark

Simple?

Very! Now every time I check in some code changes to my code repository, my CI/CD pipeline will run and execute the unit tests and code to load/transform data and build tables in Snowflake. I can query my CUSTOMER and CUSTOMERS_BY_COUNTRY tables (built by Snowpark) in the Snowflake UI as below:

CUSTOMER table that is created by Snowpark
CUSTOMERS_BY_COUNTRY table that is created by Snowpark

There are a lot of other options to implement DataOps pipelines with Snowflake. I hope this simple example of building CI/CD pipelines with Snowpark and Azure DevOps is helpful to get you started.

Snowpark truly enables Analytics teams to add software engineering practices to their development and release processes in the Snowflake Data Cloud.

Thank you for taking the time to read this blog post.

--

--

Eda Johnson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

NVIDIA | AWS Machine Learning Specialty | Azure | Databricks | GCP | Snowflake Advanced Architect | Terraform certified Principal Product Architect