Snowflake Solution Anti-Patterns: Spark is My Hammer

Hadoop is Not a Cloud Data Warehouse

John Aven
Hashmap, an NTT DATA Company
3 min readJan 30, 2020

--

Spark has been the answer to processing all big data in memory but is that still true? In the world of big data, Spark was the data engineer’s answer to solving big data needs that Hadoop demanded. Loading large amounts of data into memory and performing SQL and dataflow type operations on the data was a major requirement. Spark’s predecessor, MapReduce, required data to be read and written to disk many times.

However, as Snowflake has taken on a much larger part of the overall data and analytics market, we need to reevaluate the place of Spark in the stack.

Is Spark a Long-Term Solution?

The question comes down to how we will be doing transformational logic for Snowflake and whether Spark is an appropriate tool. So here are the common reasons Spark is utilized:

  1. We have developed skillsets around Spark and have existing solutions in this space that we don’t want to rewrite.
  2. Spark developers are in high demand.
  3. We need to perform in-memory to get optimal performance.
  4. You can boost performance with query pushdown to Snowflake.
  5. Data is in many locations and we need to use these sources in our transformations.

Well, there are many issues with these assumptions:

  1. Spark requires highly specialized skills, whereas ELT solutions are heavily reliant on SQL skills — much easier to fill these roles.
  2. ELT solutions are also much easier to maintain and are more reliable; they run on Snowflake’s compute and Snowflake manages the run configurations.
  3. Spark connector will pipe data through a stage (in/out), and while temporary, is an extra step in the processing pipeline.
  4. Query pushdown requires no use of UDFs within Spark; anyone who has worked with Spark knows that UDFs are very common. When this happens the data will be copied into memory anyhow.
  5. If you have data in multiple systems, then you may need to adjust your cloud data strategy. Consider moving the data that is necessary for the transformations into Snowflake as well.

Our Suggestion for Snowflake

Don’t rely upon Spark workloads as a long-term solution. When you already have significant investment in Spark and are migrating to Snowflake, have a strategy in place to move from Spark to a Snowflake-centric ELT solution.

There are various options here:

  1. Use of Stored Procedures and Scheduling/Orchestration (Tasks, Airflow, etc.).
  2. DBT (data build tool) with orchestration/scheduling — consider their cloud offering.
  3. Orchestration tool with SnowSQL queries.

I realize that there are times when Snowflake + Spark is a logical solution and possibly a ‘best’ solution. These cases, however, should be the exception rather than the rule. Know when it is best and don’t use it as a hammer for all of your data transformation needs.

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap by following our Engineering and Technology Blog.

Other Tools and Content for You

John Aven, Ph.D., is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

--

--

John Aven
Hashmap, an NTT DATA Company

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”