Should I use Spark? It’s complicated

Published in

skai engineering blog

9 min readNov 29, 2022

How do you know when you should use Spark?

Apache Spark has positioned itself as one of the leading tools for large-scale data processing. However, when facing a specific data-processing problem, it’s difficult to know whether Spark would be the most suitable solution. This becomes even more challenging if your company doesn’t have much existing experience or infrastructure around Spark. Using Spark requires adjusting your algorithms, learning the library, handling resource management, managing clusters (or paying for platforms to do it for you) and most importantly, developer time. Is it worth the investment? Will it solve your problem?

Skai has recently centered its focus towards data. We’ve always had a lot of data in Skai — it’s time to use it to generate more meaningful insights for our customers. Most of our ETL flows are simple Airflow DAGs with steps that run simple or complex SQL queries. Very quickly we reached a point where there was a need for heavy processing in order to generate specific datasets for all our products. That is where our story begins.

In this post I’ll explain the journey we took to solve a seemingly simple problem (that turned out not to be so simple after all), starting with a naive implementation and gradually moving to the full blown solution implemented in Spark.

The Problem

As part of building our knowledge base over our data, we faced a common online-retail problem of merging product sets across various retailers — i.e. identifying identical products sold on different platforms (e.g. Amazon, Walmart) even though each platform might identify products using a different subset of globally-unique identifiers. In our case, each product has three types of IDs: Each source has its own mandatory ID (“source ID”) and two additional global IDs that are optional but are used across sources.

The mission was to create a unique ID for each group of product instances that represent a single product spread across sources.

A simple rule was used to unite products: if they shared the same ID of the same type of ID, then they were the same product. Which in turn can create long chains of IDs that keep on merging product instances. e.g:

Product1 and Product2 are connected via Source ID SID_111
Product3 and Product1 are connected via Global ID 1 GID1_222
Product4 and Product3 are connected via Global ID 2 GID2_222
All the steps above should result in the first 4 products merged into a single product with one unique ID

The output we’re looking for has the following structure:

The job had to run on a weekly basis, meaning that the process should not run more than a week.

Iteration 1: Draft algorithm in Python

With this requirement came a draft algorithm that was written in a Python notebook. It completed processing a small dataset of about 1 million products in about an hour, and had 2 issues:

The algorithm was non-deterministic, meaning different order of the rows resulted in different grouping, which was obviously unacceptable.
The algorithm wasn’t really scaleable: at worst case scenario it was running in O(N²) requiring O(N) memory, which would result in horrible results on our actual datasets.

The algorithm was just a draft, but it gave us the first step to see the bigger picture. With a lot of thought put into it we came up with a new improved algorithm which was both deterministic and more scalable — with an O(N) complexity and requiring O(3N) memory (which is still O(N)). The improved algorithm was inspired by the standard DFS algorithm that starts from a single product ID and digs its way through the entire chain of IDs until the group is complete. Now all that was left to do was to implement and use it at scale.

Iteration 2: Plain SQL

Since our data is stored in Snowflake and due to the frequent table access required by the algorithm, our first instinct was to implement the algorithm in plain SQL — keeping the computation as close to the data as possible.

Such an approach would also fit right into our ETL flows as just another SQL step in an Airflow DAG. Unfortunately, due to long ID chains and the “loopy” nature of our algorithm, it would require an extremely complex SQL query which we would probably have a hard time reading and maintaining.
The idea of implementing it in plain SQL was therefore dropped, but the hope of using Snowflake as the processing engine was not lost just yet!

Iteration 3: JavaScript stored procedure — in-memory processing

We knew that Snowflake supports writing stored procedures in multiple languages, and decided that JavaScript would be the quickest way to implement the algorithm. We implemented and tested the following naive JavaScript stored procedure on a small table to verify that it does the job correctly:

After a successful test, we started running it on bigger datasets. Within a few hours we started seeing failures over out-of-memory errors. The Snowflake functions couldn’t handle all the data in-memory and failed.

Iteration 4: JavaScript stored procedure — IO-based processing

We realized relying on memory in the function wasn’t going to cut it. Since we were already running within the database, we could just let Snowflake handle the memory while we focused on the processing.

We rewrote the JavaScript function to repeatedly query Snowflake for the data instead of loading much of it into memory. After successfully verifying the validity on a small dataset, we moved to test it on bigger datasets.

This was one tedious part with a lot of waiting. The solution was so slow that eventually Snowflake automatically terminated the process *insert frustration*. Our Snowflake instance is configured so that not a single process would run over 48 hours, because that indicates a wasteful process and is therefore not recommended.
We could change that configuration for a specific user, but we had no idea if we were close to being done with the job and needed just a few more hours, or, so far away that it would take longer than a week.

Iteration 5: JavaScript stored procedure — Bulk processing

We had to understand how close we were to finishing the job. We decided to try processing the data in bulks — which would reduce the number of queries and give us visibility into our process’s progress. The key was to come up with the right bulk size. If the bulk was too small, the processing would take too long and be terminated. If the bulk was too big, we would run out of memory.
Unfortunately nothing was good enough — we were either too slow or consuming too much memory. We had to figure out a different way. But we learned a lot about the nature of our problem and the requirements of our algorithm.

Iteration 6: Spark POC

Separately from our project — a group of architects were leading a process of enabling Spark adoption in the company. We had a staging account in Databricks (a platform to work with Spark that provides automated cluster management) that was being tested with a few POCs.

I met with them to discuss the algorithm, which fortunately lended itself nicely to being distributed across nodes: The way the algorithm works allows each group of IDs to be discovered separately, without sharing any information between the groups. Spark’s distributed architecture of multiple nodes and multiple threads in the nodes is already abstracted and ready to use, which works to our advantage. They suggested that Spark was just the thing that could fit our algorithm and solve our problem. It was decided, we would try Spark.

I started self learning and writing a Scala application using Spark on a local machine, as a POC to verify that Spark could indeed solve our problem. I probably should have started with a course or two in Spark before I got into the process of writing the application and learning Spark as I go, but time was crucial and things had to be done quickly. The result was a lot of Spark problems I learned how to solve on the fly. I wholeheartedly recommend you complete at least a basic course about Spark before you begin to actually develop the application, because Spark uses a unique distribution model that’s hard to understand without some formal introduction.

Here’s a rough representation of what this algorithm looks like when implemented using Spark:

you complete at least a basic course about Spark before you begin to

The local Spark POC app was doing well on a selected subset of data — so once again we went on to test our solution at scale.

Iteration 7: Spark over Databricks

We started testing the application in our staging Databricks account on a selected dataset. We immediately had to do some Databricks-specific tuning, such as configuring the clusters and their supported language versions, their size, memory, CPU, external credentials etc., while trying to maintain low expenses.

After successfully processing staging datasets, it was time to move to the final stage, full production data. The result was that the Spark application completed after roughly 60–90 minutes, a very delightful surprise that blew past the 1-week limit we started with, which in turn changed the requirements for the process to be run daily instead weekly.
Was that the end?

Iteration 8: Spark fine tuning

Once we had a more-or-less working solution, new requirements and some last-minute changes set the execution time at close to 2 hours. With more potential future enhancements in mind, we decided this wasn’t good enough, as it doesn’t give us enough leg room to keep the 1-day SLA in case volumes increase further.

We decided to go into another optimization cycle and managed to reduce execution time to 12 minutes — mostly by moving some of the pre-processing into Spark, tuning partitions, and hashing some long string IDs used as comparison keys. A year later, we still use the same configuration maintaining a 15 minute SLA, while the data has more than doubled in the meantime.

Conclusion

This long process of trial-and-error wasn’t easy, but it was very educational when it comes to choosing the right tool for a large-scale data-processing job. Here are some of the conclusions:

What are the requirements?

Everything starts with requirements. How fast does it have to be? How much data is involved, and how fast will it grow? Is it a one-time process or will it run on a daily basis? How much are you willing to spend on resources? How many hours of development time can you spare? Will it need to be maintained?
There are plenty of questions you need to answer before you decide on a course of action.

Measurement is key

Always measure, calculate and make estimations based on these measurements. Start small, scale it up and see how things work and if they fit your needs. Try to deal with conditions as close as possible to the real thing. What will the final size be? How long will it run? Is your test set indicative of real-world data or will you need to change your code entirely? All those measures could help you decide if what you are using will solve the problem or not.

No premature optimization

Optimization and refactoring is time and effort consuming. If not done properly, the final results might not even be optimized towards the right goals. Make sure you actually need this optimization. With our application completing successfully within 2 hours, we only went into the last optimization after realizing datasets will grow fast enough to put us dangerously close to our 1-day limit.

Don’t jump on buzzwords

Your company has its own best practices, its own technology stack and experience. Multiple teams have faced multiple challenges before, which is a valuable resource on its own. Before you jump on what’s “hot right now”, consider the assets (knowledge, licenses, infrastructure) that already exist in your company. Have people faced similar issues before? Was the scale similar? Which existing technologies might help? How well your company handles new technologies and who will maintain them? Will it help others in the future? Every buzzword has a price.

Draw your own conclusions

Starting small and optimizing gradually teaches you a lot about your problem. Use that knowledge and determine the best practices for the future. We tested boundaries and learned the limits of our technologies, and roughly know what problems each technology can solve. Now we can advise on future problems whether the solution is worth the effort. We have baked some of that knowledge into our own Spark application “template” that resolves multiple technical issues that future teams will no longer need to face. We learned a lot, made our own conclusions, and improved things for the next people in line, and I hope you did too :).