We Just Cut 85% of Our Data Streaming Pipelines Cost! (Part 1)

Published in

AppsFlyer Engineering

8 min readJan 18, 2022

Handling the streaming of 150 billion events per day can be a pretty tough task, that could raise some complex production issues.
Join me on my streaming journey that began with some takeaways from our previous streaming model and ended with rebuilding a new streaming solution on production that solves most of our issues and cuts 85% of our streaming costs along the way.

For the past three years, I’ve been working at AppsFlyer as a Data Engineer — designing and building data pipelines, creating new streaming models, and pretty much eating, drinking, and breathing data.

AppsFlyer has moved from 30 to 150 billion new daily incoming events. And we’re still counting… 😁

AppsFlyer is a leading mobile attribution solution for measuring the effectiveness of mobile campaigns, starting with AD impressions and click events, and moving to SDK mobile application events. For our clients — who are mobile app owners — we take this raw data and turn it into unique insights, predictions, and fraud awareness.

In this post, I will walk you through the old streaming solution and examine the classic Lambda architecture’s strengths and flaws. We will go over some performance issues that we had, point out the issues’ root cause, and see why we could not tackle those issues successfully within the existing pipeline design.

In a second follow-up post, I will share the new streaming solution model based on using the Delta Lake framework, and demonstrate how we are able to cut our streaming costs dramatically!

So without further adieu, let’s start the show.

What Went Wrong With Our Classic Lambda Architecture Model?

Handling both batch and streaming data is a very common challenge in big data. The classic Lambda architecture is a well-known model pattern that handles both batch and streaming processing — providing a unified view over the two data sources.

Having said that, in recent years as our traffic scale keeps growing, we have also come across some growing pains, frustrations, and cost issues with our classic streaming model. At some point, we understood that if we wanted to stop with the teeth grinding and jaw clenching, we needed to come up with a better way to handle our data streaming pipelines.

Classic Lambda Architecture

In order to show our clients insights from the past 30 minutes, days, weeks, months, and years, we need to combine two kinds of data pipelines — historical data produced by batch processing, and real-time data produced by real-time processing.

The classic Lambda architecture provides a model that handles both batch processing and real-time streaming pipelines. Each pipeline writes its data into a separate data store.

The following diagram shows our old AppsFlyer streaming pipeline, which complies with the classic Lambda architecture model.

All incoming events are filtered, enriched, and classified into Kafka topics and delivered to Kafka consumers.

At the top of the diagram above, we can see the real-time streaming pipeline that converts raw data and writes the converted raw data into the Clickhouse DB cluster.

At the bottom, we can see the batch pipeline, which is divided into two phases:

Data Lake streaming — where we write the events into Data Lake under the corresponding partition day.
Product output — where every morning we run a daily batch processing for each product that reads the entire previous day’s events. This means running the product business logic and adding the aggregates of the previous day into the historical data store.

We use different data stores for batch and real-time streaming because — although we like working with Druid for holding our aggregated historical data, Druid did not handle our real-time raw data ingestion well performance-wise. We could, however, re-visit this decision with future versions of Druid.

As illustrated below, after some time, we saw a significant improvement with writing the raw data into the Data Lake, as we used the Delta Lake framework to write the raw data into Data Lake.
More on that later in the post…

Lambda Architecture Flaws

So, while it seemed like everything was going smoothly with our streaming model, under the hood several issues kept coming up.

The First Issue: How Do We Handle Errors at the Streaming Pipelines?
Anyone who works on a streaming product, quickly understands the urgency factor, especially when it arrives at 1 AM. Every issue becomes a Die Hard issue, where you need to fix it right away — no time to wait for tomorrow morning.

On the batch processing side though, things are typically much smoother. If there’s a problem, you can just restart the process, and issues can wait a few hours to be fixed. So if we try to create a Work-Life Balance (WLB) comparison between living at the streaming vs. batch pipelines, it might look something like this:

Error handling policy on the real-time streaming products became a real pain for any development team that consumed raw data from Kafka since every issue that arises at the front-end (raw data processing/Kafka) became an annoying issue for all Kafka consumers.

Whether it was an initiated update in Kafka cluster with the deploying of a new Kafka cluster version, or a P0 alert on data lag, issues with the Kafka cluster would create a session of several hours dedicated to analyzing the problem. This was handled by freezing or resuming the data streaming, changing some code, and verifying that the streaming was back to normal.

The Second Issue: Losing Our Single Source of Truth

The Lambda architecture model basically duplicates the raw data streaming into a real-time streaming and batch pipeline.

Every raw-data processing, validation, or conversion that is done on the batch side and not on the streaming side creates an issue where we don’t have a Single Source of Truth for the raw data of the real-time streaming and batch Data Lake parts.

Performance and Cost Issues

As we encountered PD lag alerts over the streaming topics, we had to increase the Hadoop cluster size — which runs the Spark Streaming processing jobs — to keep the lag delay within a reasonable limit.

For example, for one major streaming product, we had to increase its Hadoop cluster size over the last year from 295 to 480 machines, as illustrated in the figure below.

What Could Be the Reason for the Sharp Increase in the Hadoop Resource Size?

One reason for the increase was the amount and rate of raw data entered into AppsFlyer each day. The incoming raw data throughput kept increasing, as we can see below

But the daily traffic rate is not the end of the story, and sometimes we need to zoom in to the daily traffic rate pattern. This is because a sharp spike in the traffic rate over a window of a few hours could also trigger a PD alert, which is handled by increasing the Hadoop cluster size.

So, How Can We Improve Our Streaming Resource Utilization?

I have some ideas on this, so let’s first go over the following questions and see whether we can use them in order to improve our resource usage.

How many Spark jobs should we use for the streaming product?
For each Kafka cluster, we have to create a separate Spark job. We are using six Kafka clusters to consume 17 raw data topics — meaning we need six streaming Spark jobs!
Can we run auto-scaling over each Spark job?
Nope, Spark’s Structured Streaming has hungry and demanding characteristics. Our experience shows us that whenever we used the auto-scaling policy, the Spark Structured Streaming took more machines on pressure time but did not release them once the pressure was gone.
Can we use overall load balancing for the six Spark jobs?
Nope, We can try to run load balancing over Six Spark jobs, but few of them could get into a starvation state.
Within one Spark job that consumes a few Kafka topics, we can run local load balancing, but this is not good enough!
Is there a correlation between traffic peaks and throughput?
The short answer is yes. We can see a straight line between the continuously increasing incoming traffic and peaks, and the amount of Hadoop resources required to handle them.

The last issue is the PD alert burden.

Over the last year, we have seen an increasing PD alert burden with development teams.

The PD alerts are the consequences of some of the issues that we just discussed, including:

Frontend pipeline issues: Any issue at the front end of the pipeline — whether PD-0 or any planned maintenance on a Kafka cluster — had to catch the attention of all Kafka consumer products for a time window that could take hours to solve.
Increasing traffic rate and peaks: We also noticed an increasing rate of LAG alerts over streaming topics, where the lag exceeded two hours.

Each lag alert required the attention of a PD shift including:

Exploring and figuring out the root cause for the lag issue.
Action item — should we increase part of:
Code change — executor instances.
DevOps — Increase the Hadoop cluster size.

This issue created a burden on the development teams that handle streaming -adding frustration and fatigue on top of it.

Conclusion

In my follow-up post, I’ll show you how our new streaming solution — based on using the Delta Lake framework — helped us cut streaming costs dramatically and solve most of the issues we just described here.