Snowflake Solution Anti-Patterns: Spark is My Hammer

Hadoop is Not a Cloud Data Warehouse

Published in

Hashmap, an NTT DATA Company

3 min readJan 30, 2020

Spark has been the answer to processing all big data in memory but is that still true? In the world of big data, Spark was the data engineer’s answer to solving big data needs that Hadoop demanded. Loading large amounts of data into memory and performing SQL and dataflow type operations on the data was a major requirement. Spark’s predecessor, MapReduce, required data to be read and written to disk many times.

However, as Snowflake has taken on a much larger part of the overall data and analytics market, we need to reevaluate the place of Spark in the stack.

Is Spark a Long-Term Solution?

The question comes down to how we will be doing transformational logic for Snowflake and whether Spark is an appropriate tool. So here are the common reasons Spark is utilized:

We have developed skillsets around Spark and have existing solutions in this space that we don’t want to rewrite.
Spark developers are in high demand.
We need to perform in-memory to get optimal performance.
You can boost performance with query pushdown to Snowflake.
Data is in many locations and we need to use these sources in our transformations.

Well, there are many issues with these assumptions:

Spark requires highly specialized skills, whereas ELT solutions are heavily reliant on SQL skills — much easier to fill these roles.
ELT solutions are also much easier to maintain and are more reliable; they run on Snowflake’s compute and Snowflake manages the run configurations.
Spark connector will pipe data through a stage (in/out), and while temporary, is an extra step in the processing pipeline.
Query pushdown requires no use of UDFs within Spark; anyone who has worked with Spark knows that UDFs are very common. When this happens the data will be copied into memory anyhow.
If you have data in multiple systems, then you may need to adjust your cloud data strategy. Consider moving the data that is necessary for the transformations into Snowflake as well.

Our Suggestion for Snowflake

Don’t rely upon Spark workloads as a long-term solution. When you already have significant investment in Spark and are migrating to Snowflake, have a strategy in place to move from Spark to a Snowflake-centric ELT solution.

There are various options here:

Use of Stored Procedures and Scheduling/Orchestration (Tasks, Airflow, etc.).
DBT (data build tool) with orchestration/scheduling — consider their cloud offering.
Orchestration tool with SnowSQL queries.

I realize that there are times when Snowflake + Spark is a logical solution and possibly a ‘best’ solution. These cases, however, should be the exception rather than the rule. Know when it is best and don’t use it as a hammer for all of your data transformation needs.

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap by following our Engineering and Technology Blog.

Snowflake Solution Anti-Patterns: Spark is My Hammer

Hadoop is Not a Cloud Data Warehouse

Is Spark a Long-Term Solution?

Our Suggestion for Snowflake

Need Snowflake Cloud Data Warehousing and Migration Assistance?

Other Tools and Content for You

Snowflake Utilities & Accelerators | Do more with Snowflake | Hashmap

Try out all the Snowflake utilities that Hashmap has available and do more with Snowflake: Snowflake Inspector…

Snowflake Solution Anti-Patterns: The Perpetual Data Scientist

That Wasteful 'Parquet' Layer Those of us working within the big data space have been trained to build out various…

3 Best Practices for On-Prem to Snowflake

Modernizing, Migrating, and Getting Data to the Cloud and Snowflake by Kelly Kohlleffel Our consulting services and SI…

#Tech in 5 — Snowflake & Dask

Why Snowflake and Dask could revolutionize data discovery for data engineers and data scientists alike by providing a…

Schema Unification with Snowflake

A Design Made Simpler

Using DBT to Execute ELT Pipelines in Snowflake

DevOps and DataOps for Snowflake with DBT and Azure DevOps

Written by John Aven