Yahoo Benchmarks Dataflow vs Self-managed Flink Efficiency for two Streaming use-cases– Which is More Cost-Effective?

Ihaffa

Published in

Google Cloud - Community

6 min readMay 8, 2024

This blog post is co-author by Abel Lamjiri, Sr Principal Software Engineer at Yahoo

Introduction

While working with Yahoo, we’re constantly seeking ways to optimize the efficiency of streaming large-scale data processing pipelines. In a recent project, we focused on benchmarking the cost and performance of two stack choices that Yahoo wanted to understand: running Apache Flink in a self managed environment and Google Cloud Dataflow on two specific Yahoo use cases. This post details our benchmark setup, methodology, the use cases in scope, key findings, and the Dataflow configurations that helped us streamline performance.

Benchmark Setup

We designed our benchmark to ensure a quick, fair and rough comparison on Yahoo typical use cases; we chose two representative workloads, one compute heavy and another IO heavy task; the result of this benchmark would indicate which platform to be recommended to Yahoo teams for streaming, including writing results to BigTable, GCS, and complex streaming pipeline discussed below.

Test Setup Infrastructure Diagram:

Metric: Our primary focus was compute cost per unit of throughput — the volume of data processed per unit of time. We aimed to understand the costs of running Flink and Dataflow pipelines at a sustained throughput of 20,000 records per second. “Cost” for Flink excludes operational overhead of setting up and running the job in the respective platform (engineering hours used).

Workload: We simulated a 10TB data stream from Pub/Sub, creating a backlog of 100+ million records to maintain a constant load.

Controlled Environment: To focus on core configuration impacts, we initially disabled autoscaling by fixing Flink’s resource allocation and setting a limit on Dataflow worker count. This allowed us to compare costs at a consistent throughput: i.e. cost-per-unit-of-throughput. Auto Scaling would be a dimension in the ease of operation category, which we did expand in this experiment. We verified Dataflow scaling and found it seamlessly though.

Use-case details:

Write Avro to Parquet: Streaming job which reads avro message which then get windowed to certain time frames (1–5 mins) and outputted to gcs as parquet
Data Enrichment and Calculation (Complex): Simulating active user analysis and event enrichment. Involved Beam state management, key reshuffling, and making external calls to another service.

Notes:

Operational costs of managing infrastructure are excluded for both platforms.
If the benchmark would show similar cost for both Flink and Dataflow infra cost, we were ready to pick Dataflow because Dataflow is a managed service. We were ready to benchmark a third use-case if the two use-cases above were not conclusive.

Understanding The Result:

The table above shows that Dataflow is around 1.5–2 times more cost effective in comparison to Apache Flink on our test cases so let’s understand in more detail how we get all this cost. The idea for the benchmark is to: calculate Flink/Dataflow cost to achieve similar throughput. The goal is to have as close as possible on the number of messages processed per second on each of the streaming applications. In the table above, for the Enrichment use case, the number of provisioned vCPU on GKE is approximately 13x higher compared to Dataflow. It is not because Flink is inefficient, but it is because in Dataflow, Streaming Engine would sent a lot of the heavy computation to Dataflow backend. Of course, there was some room to improve Flink utilization, but it turned out to make the job unstable, so we did not spend further time there and calculated the cost for 32 vCPUs (2 x n2d-standar-16 machines) as if utilization was ~75%.

You can think of the Dataflow backend as a GCP backend resource to do heavy computation (e.g shuffling) instead of the work being done on Dataflow’s worker. This naturally makes Dataflow to require less vCPU, make it more robust and gives much more consistency throughput. It is critical for Yahoo use cases to leverage the streaming engine.

In the below image, our dataflow pipeline uses a new release cost billing feature which calculates cost based on Stream Engine Processing Unit. From our testing, the new billing feature was able to optimize pipeline cost for our throughput based workloads. On the Flink side, we install telemetry and monitor pubsub throughput to check the amount of resource it is using. For Flink setup, we did not spend too much time tuning the job and therefore, assumed lowest cost if we had improved utilization; i.e. the cost is based on having 32 vCores assuming we can get around 75% cpu utilization at best case.

To see detail breakdown of the cost of Dataflow, you can simply goes to Dataflow Cost Tab, like shown below:

Here is the Dataflow command that we use to deploy our settings (in gradle):

task dataflow(type:JavaExec) {
mainClass = “com.google.benchmark.BenchmarkDataflow”
classpath = sourceSets.main.runtimeClasspath
args = [
“ — enableStreamingEngine”,
“ — workerMachineType=n2-standard-8”, //or n2-standard-2 with 2 workers
“ — dataflowServiceOptions=enable_streaming_engine_resource_based_billing”,
“ — numWorkers=1”, // or 2 workers
“ — maxNumWorkers=1”, // or 2 workers
…
“ — usePublicIps=false”,
]
}

The original Dataflow Streaming Engine usage is currently metered and billed based on the volume of data processed by Streaming Engine. With the resource-based billing model, jobs are metered based on the resources that are consumed and users are billed for the total resources consumed. In short, Streaming Engine Compute Units are used to calculate for “Resource Based Billing” while previously the streaming engine cost are calculated by the amount of data being sent/ process (Total GB of data process)

How to tell the difference?

If you were to use “resource-based billing model”, you will be charged based on the SKU (stock keeping unit) not by the amount of streaming data processed
In the “Cost” tab, you should see “Streaming Engine Compute Unit” instead of “Processed Data”

In addition to basic parameters, further optimization in Dataflow can be achieved by carefully customizing machine types to your specific workload. Dataflow Prime can also enhance both performance and cost-efficiency in some scenarios.

Conclusion

This benchmark highlights the importance of careful configuration and ongoing experimentation when optimizing Dataflow pipelines. The ideal setup for your Dataflow deployment is highly dependent on your specific workload, data velocity, and cost-performance tradeoffs you’re willing to make. Understanding the various optimization tools within Dataflow — from worker configuration to features like autoscaling — is crucial for maximizing efficiency and minimizing cost.

When using Resource Based Billing, for the two use-cases tested in this study, we found Flink compute cost to be on par with Dataflow. Without this flag, Dataflow was 5X more expensive — again for the two use-cases in the scope of this study.

Finally, keep an eye out for new Dataflow features! The recent introduction of at-least-once streaming mode offers greater flexibility for use cases where occasional duplicates are acceptable in favour of lower cost and reduced latency.

Yahoo Benchmarks Dataflow vs Self-managed Flink Efficiency for two Streaming use-cases– Which is More Cost-Effective?

Written by Ihaffa