Snowpark melts under the heat of real world requirements

Scott Eade
6 min readJun 6, 2022

--

Snowflake is a poor choice for a modern data platform; Snowpark does not even come close to ticking the box in relation to Data Science.

Snowflake is trying to convince us that it is not just The Data Warehouse Built for the Cloud, but rather that it should be adopted as The Data Cloud. This is an ambitious leap as, from their own marketing material, it means they now also need to cover data engineering, data lake, data science, data applications and data sharing. A good portion of Snowpark represents Snowflake’s efforts to extend into these areas, most of which has them attempting to imitate fragments of the powerful and flexible techniques that have existed for some time in open-source Apache Spark™. Let’s examine what Snowpark provides, how it can be used and consider whether it is fit for purpose under the harsh light of real-world production requirements.

Under the Snowpark banner, Snowflake have so far provided

  • “programming language constructs for building SQL statements” with a DataFrame construct that uses lazy execution. The offered approach is quite similar to a small subset of the available Spark APIs. In Snowpark DataFrame objects can be created in Scala and Java with Python support coming soon. When lazy execution of a DataFrame occurs, the Snowflake client code converts these operations to SQL and sends this to the configured virtual warehouse to execute.
  • the ability to create user-defined functions (UDFs) that can be packaged up, pushed over to Snowflake via a Stage, and then registered for use in Virtual Warehouses. Once these steps have been completed it is then possible to use the functions in SQL. Snowpark touts UDFs written in Java, Scala and soon Python.

Of course these capabilities have existed for some time in Spark which itself also supports SQL, so is there now some level of equivalence between these two? The short answer is no, Snowflake have a long way to go before any level of equivalence to Spark could rightly be claimed and there are architectural differences that make it unlikely they will get there any time soon.

Snowflake have a long way to go before any level of equivalence to Apache Spark could rightly be claimed and there are architectural differences that make it unlikely they will get there any time soon.

Let’s first recall that Spark was originally conceived as a means of processing vast quantities of data at speed by leveraging in-memory processing of clusters of compute resources. SQL is a natural fit for data access and the development of Spark SQL was in many ways an echo of the way Hive provided a way to use SQL initially with Java MapReduce and later with Spark itself. Let’s take this one step further and suggest that the DataFrame support in Spark was inspired by the underlying processing needed in order to support SQL. The DataFrame API was added way back in Spark 2.0 along with the Catalyst optimiser that brought new levels of performance for processing data using DataFrames and as a consequence also for Spark SQL. Contrast this with Snowpark which provides a client API that supports yet another (i.e. incompatible) implementation of DataFrames along with UDFs which result in SQL that is pushed across to a warehouse to execute. Here are just some points of differentiation

  • Spark can use DataFrames for processing requirements other than SQL, such processing can be optimized down at the DataFrame or even lower levels. Snowflake has no choice but to reverse these abstraction layers in a nonsensical way; while they do have a reasonably performant SQL processing engine, they are now having to hack at this to call out to UDFs written in other languages, so not only are they having to compromise their SQL engine, the opportunities for optimisation are scattered (SQL engine, interface to other languages, the other language code itself and the client code).
  • The Spark cluster architecture includes a driver node coupled with a potentially varying number of worker nodes. The driver node is where the Spark code runs and where tasks for the worker nodes are created, scheduled and controlled from. The Snowflake virtual warehouse, and hence the Snowpark architecture, leaves a void in relation to where its client code should execute; their demonstrations and examples mention spinning up additional non-Snowflake infrastructure such as a “small virtual machine” or running code directly from local developer machines; this is not something that should be required from a PAAS offering and is certainly not enterprise scale or production grade. Taking this a step further, where should this client code be created, stored, maintained, tested, versioned, governed, secured, etc.? Enabling you to use a DataFrame API and UDFs is one thing, but the omission of robust DevOps processes and code execution environments is evidence of how nascent the Snowpark offering is. Competitors to Snowflake also typically provide collaborative code development environments; the introduction of “choice” here means work for their customers.
Snowflake Warehouse sizes with Credits per hour.
From Overview of Warehouses. This table used to show that there was one server per Credit. There is no automatic scaling of Warehouse Size.
  • Autoscaling in Spark caters for the processing capacity of a cluster to efficiently align with what is required for a data pipeline, adding worker nodes when processing demands it and removing them when no longer required. Autoscaling in Snowflake is only about concurrent access; when the query volume pushes a warehouse to its capacity additional warehouses of the same size are spun up; this is of no help whatsoever to a data pipeline with varying processing needs and so it is more likely that a Snowflake warehouse is sized for the largest portion of processing and idle in other parts of the pipeline (though still incurring costs).
Chart showing how deployed and active nodes may vary over time when autoscaling is used on a Spark cluster.
Deployed and active worker nodes when Autoscaling is enabled in an advanced Spark PAAS environment.
  • With respect to scale, it is also relevant to point out that in the Spark API a DataFrame is a construct that spans the worker nodes in a cluster. In addition to this Spark 3.2 includes a Pandas compatible API allowing users experienced with Pandas to expand their processing beyond their local machine without having to learn a new API. The Snowpark demonstrations and examples more often than not limit themselves to data that can be collected back to the machine executing the client code, thus limiting data volumes and the overall utility of how it might be applied. Spark is helping Data Scientists expand to larger data volumes; Snowflake has spent a significant amount of time enabling retrieval of data that would more easily be retrieved via a JDBC driver.

While imitation is the sincerest form of flattery, and there is no getting away from the fact that Snowpark is imitating Apache Spark, there is also no getting away from the fact that Snowpark is at this point in time a primitive offering that is in no way suitable for enterprise or production deployment.

Don’t let Snowflake fool you into ticking the data engineering and data science boxes when evaluating data platforms.

I have provided an overview of the current Snowpark preview offering and examined just a small number of differentiators that illustrate why it is not fit to be labeled as enterprise or production ready. There are a myriad of additional issues and missing pieces in relation to data science and machine learning; perhaps good food for a follow up post. In the meantime, don’t let Snowpark fool you into ticking the box in relation to data engineering and data science.

Snowpark is an attempt by a single company to rebuild something that already exists in a completely open-source form.

Setting aside all of the above, the number one argument against Snowpark is the fact that they are rebuilding an incompatible version of something that already exists in a completely open-source form. Snowflake’s version is proprietary and all innovation depends on a single organization as opposed to crowd-sourced contributions to an open-source implementation. Closed/proprietary platforms stifle the pace of innovation and Snowflake exemplifies this by working to reproduce what has already been done rather than implementing new, forward thinking capabilities.

Snowpark is one of many reasons why choosing Snowflake is not a good bet to make when it comes to choosing a data platform.

--

--

Scott Eade

I have been in data for more years than I am prepared to admit. My SnowPro Core Certification has lapsed.