Bring Your Code to the Snowflake Data Cloud with Snowpark

George Luft
Hashmap, an NTT DATA Company
5 min readJun 9, 2021

Snowflake Summit 2021 is here! Amid all of the guest speakers and customer showcases, Snowflake will be rolling out some new features that are currently in private preview. Two of these are Snowpark and Java UDFs. Both will enable the use of Snowflake’s elastic and scalable compute resources for the processing of large datasets without having to pull high volumes of data back and forth over the wire into local or virtual compute clusters.

Snowflake’s announcement in November 2020 stated that Snowpark is “a new developer experience that will allow data engineers, data scientists, and developers to write code in their languages of choice, using familiar programming concepts, and then execute workloads such as ETL/ELT, data preparation, and feature engineering on Snowflake. This simplifies an organization’s IT architecture by bringing more data pipelines into Snowflake’s single, governed core data platform. By doing so, data professionals seamlessly leverage the scalability, performance, security, and near-zero maintenance benefits of Snowflake. Snowpark will enable developers to leverage existing skillsets, improve team productivity, reduce cost with fewer systems in a customer’s architecture, and extend Snowflake’s capabilities for additional data engineering and data science use cases. Snowpark is currently available in testing environments only.”

The Snowpark API has been implemented in Scala. This means that existing Scala applications can be modified to take advantage of this Snowflake compute option with minimal rewriting of code while also saving on potential egress charges.

The power of DataFrames and UDFs

Traditionally, developers have used Scala or Python to operate on Spark DataFrames. Snowpark has a rich set of methods to manipulate DataFrames, and can also be augmented by User Defined Functions (UDFs) written in several programming languages. Beyond the current offerings of JavaScript and SQL UDFs, you can now upload, register, and call Java UDFs. Additionally, anonymous UDFs can be implemented directly in line with Scala and in the same language.

These new UDF options allow developers to continue to use the same third-party libraries and packages with which they are already familiar, yet now this code can be executed by the Snowflake compute warehouse adjacent to the data itself.

Snowpark DataFrame operations are rendered as SQL code by the Snowpark API and pushed down to the database. These data operations are executed by the Snowpark runtimes to minimize the amount of data transferred to the client application.

So, what can I do with it?

With the simplicity and power of DataFrames in mind, you can leverage the Snowpark API to write data engineering applications or even do exploratory analysis using its intuitive table-like structures with the strong data typing of Scala.

As a long-time Snowflake Partner, we were given early-bird access to the API, and here’s how we are putting it to use:

Performing the ’T’ of ELT

The most typical interaction with a data warehouse is to perform OLAP operations. So far, without reaching out to use external programming languages, we were dependent on JavaScript stored procedures to do much of the complex processing. When it came to using external languages like Python, it became difficult to read data into memory and then process the data. This has been unlocked with Snowpark. Now, the same processing can be done using an external language (Scala) and use its programming constructs directly against the DataFrames to express the entire transformation and have it executed at the source.

Data preparation and exploration as a pre-step to ML model training

Another set of data interactions is data exploration, profiling, and similar exercises in a more collaborative environment such as a Jupyter Notebook. We have set up an Almond kernel to enable Scala within Jupyter and used the Snowflake API to interact with data on the fly.

Integration with the Hashmap MLOps Framework

We have been able to integrate Snowpark into Hashmap’s MLOps framework, replacing data processing clusters with Snowpark API calls. This generalized use case can be applied to various financial services industry use cases such as fraud detection or credit risk scoring.

Within this framework, the Data Discovery and Model Generation phases of the process can now utilize large datasets directly within Snowflake for such activities as data preparation and data quality testing, as well as model training and evaluation. And because this same code does not have to be re-worked into another programming language, it can also be used in model deployment and usage — whether in batch or real-time transactional mode.

What’s in it for me?

There are many benefits of this approach. Utilizing Snowpark can eliminate the need for maintaining servers or clusters of servers, thus requiring no hardware infrastructure or administration. Declarative programming methods can be used instead of hand-coded, complex SQL statements. Functions can be implemented fostering code reuse.

Snowpark aligns with Hashmap’s 7S design principles:

  • Simple: Ease of development and testing via custom packages and intelligent IDE assistance
  • Speedy: Faster time-to-market driven by shorter development time for data applications
  • Sustainable: Leveraging existing Scala/Java skillbase — Same tribe, better tools
  • Self-serve: Analysts, data scientists, & engineers can perform ad hoc processing utilizing familiar tools
  • Secure: Data operations occur inside Snowflake leveraging out of the box security features
  • Scalable: Workloads executed on Snowflake compute warehouse for virtually unlimited scaling
  • Savings: Zero infrastructure setup and consumption-based pricing

We are excited for Summit and the release of Snowpark and are looking forward to new opportunities to extend Snowflake workloads across data applications, data engineering, and data science.

Snowflake Snowpark — a Hashmap Perspective

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better, together.

If you’re considering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, please contact us.

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Other Snowflake Tools and Content For You

George Luft is a Regional Technical Expert and Cloud/Data Engineer at Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--