We Just Cut 85% of Our Data Streaming Pipelines Cost! (Part 2)

Yigal Pinhasi
AppsFlyer Engineering
6 min readJan 31, 2022

It’s the Delta Lake Way or the HighWay

To recap my last post, we concluded that we have several major issues with our existing streaming model and that we would like to come up with a new, improved streaming solution.

So, What Exactly Would We Like to Achieve With This New Model Solution?

Here are a few of the goals we are hoping to accomplish:

  • Generate better Hadoop resource utilization.
  • Use fewer resources, which leads to better cost reduction.
  • Create a more tolerant system for increased traffic rates.
  • Create a Single Source of Truth of the raw data used to serve both batch and real-time streaming pipelines.
  • Reduce the team burden.
  • Less awareness of every issue in the pipeline (raw data/Kafka).
  • Less PD lag alerts and less complicated handling of streaming issues.

Where Should We Begin With the New Streaming Solution?

In 2020 our Data Infrastructure team used the Delta Lake framework to handle the writing of raw data into the Data Lake.

The new Data Lake data traffic was refined with the following Data Lake mapping changes:

  • Only the attribution dimensions were mapped in the parquet files.
  • The data footprint was reduced by ~90%.
  • The data structure in the parquet file was moved from a hierarchical to a flattened structure.

And most importantly, we used the DeltaLake framework!

“Delta Lake is an open source storage layer that
brings reliability to data lakes …”

Utilizing Data Lake Based on the Delta Lake Framework

The DeltaLake framework is a great package to put on top of Spark, but for our story, it holds great advantages and new capabilities that go along with Spark Structured Streaming.

Delta Lake is deeply integrated with Spark Structured Streaming. It provides reliable streaming and is able to combine streaming and batch processing and it refers to each table as a single source used for both batch processing and streaming.

New Model Design

Based on Delta Lake’s capability to be a single source for batch and reliable streaming, we decided to move on with Data Lake — based on the Delta Lake framework — to be our streaming source for both batch and streaming processes.

Data Pipeline Error Handling Policy

The new design model holds some significant advantages, with the first being the Error Handling Policy pipeline.

With the old streaming model, each failure at the front-end side required getting the attention of each software team that held streaming products.

Moving forward with the new model, there is only one consumer of Kafka. As such, the only team that should be aware of failures in Kafka and raw data processing is the Data Lake team.

Streaming Service Error Handling

Another issue we had was to determine the error handling policy within the streaming service itself.

Each product holds two kinds of processing that are unified into one combined view:

  • Batch processing — held at the end of the day — reads the entire previous day and adds it to the historical data.
  • Real-time streaming — reads the streaming data.

Every morning we close the previous historical day, run the batch processing update to add that day to the historical data, and remove the 24 hour-window from the real-time view.

So essentially we are reading a sliding window of up to 36 hours from the real-time window, and all the rest of the data is being fetched from the historical data.

This perspective leads us to the conclusion that the Streaming’s main goal is freshness, and it dictates the following failure handling policy:

  • Ignore corrupted files
  • Ignore missing files
  • Restart streaming on known exceptions

Single Source of Truth

The new model architecture provides us a Single Source of Truth capability as our processed and filtered raw data is serving both the Batch & Real-Time Streaming flows.
Any further changes on raw data in the DataLake phase are transparent for both pipelines and there is no need to duplicate raw-data changes in either the Batch & Real-Time Streaming pipelines in order to obtain the SSOT.

One Streaming Source — One Love, One Heart

Moving to DeltaLake Streaming could be a game-changer.
We now have One DataLake streaming source containing the entire raw data topics. We can consume all the raw-data topics theoretically within a single Spark job without any restrictions.

Running a Structured Streaming over Kafka cluster is limited to handing only the relevant topics taken from this specific cluster.
Obviously, handling 150 billion daily events and still counting :) required distribution of the raw data over several Kafka clusters, not to mention that this is a Work In Progress as once in a while we may need to change the topic's distribution.
Bottom line, when consuming from Kafka clusters we have to run several separated Spark Streaming Jobs.

Flexibility, the ability to control the number of Spark Streaming Jobs could make a big difference in our streaming utilization performance results.

Stay tuned …

So, What were the Results?

After hard coding work, deploying, and testing the new streaming model, we verified the minimum amount of streaming resources on the Hadoop cluster for the new streaming model and saw a dramatic improvement.

We were quite surprised to see that the new model was much more tolerant to the increase in traffic rate and peaks than the old model was.

We noticed that while we kept adding machines for the old streaming model, we used the same 70 machines for the new streaming model without getting any lag alerts.

The only downside was with the freshness.
Rather than getting a few minutes of latency, we got between 5–20 minutes lag as both Data Lake and real-time processing used a window streaming time of 5–10 minutes.

We Obviously Made a Massive Improvement, But What Were the Keys to Success?

We can say that it is a combination of two major changes:

Performance improvement

  • We saw a massive improvement with the new Data Lake mapping:
  • Clean data — reduced by ~95%.
  • Flattened data — much easier to load than hierarchy data.
  • Moved from Kafka JSON into Parquet columnar format (1:10 Spark parsing benchmark).
  • Migration to Spark 3 and Delta Lake.

Resource utilization

We used only one Spark job, which provides us with full load balancing over the 19 raw topics. This was a key factor for reducing the number of overall resources and having much better resource utilization.

Let’s Sum it Up: What We Managed to Achieve

  • Massive performance and resource utilization improvement.
  • Massive cost reduction — we saved up to 85% of our streaming costs.
  • Simplification of the Lambda architecture model.
  • Creation of a Single Source of Truth for the raw data sets
  • No need for any development product team awareness of the failures in the Kafka/raw data pipelines.
  • We created a tolerance model for increasing traffic and peak rate.
  • We improved our team’s happiness :)

And finally and perhaps most importantly, we saved the company a lot of money…

--

--