Scaling a Mature Data Pipeline — Managing Overhead

There is often a hidden performance cost tied to the complexity of data pipelines — the overhead. In this post, we will introduce its concept, and examine the techniques we use to avoid it in our data pipelines.

Zachary Ennenga
Sep 24, 2019 · 11 min read

Author: Zachary Ennenga

The view from the third floor at Airbnb HQ!

Background

Technical Stack

A Case Study: The Integration Test Pipeline

The Overhead: A Silent Killer

Scheduler delay: Be it a crontab, Airflow, a task queue, or something else, every complex data pipeline has some system that manages dependencies and job scheduling. Generally, there is some delay between a job completing and the next job starting.

Pre-execution delay: Great, the job has been scheduled — now what? In many cases, before the job is started, there is some pre-work to perform. For example, we have to push our JAR onto the machine that is executing our task, and perform some sanity checks to guarantee that data from previous tasks has landed. Lastly, there is the initialization of the application code, loading any configuration or system library dependencies, and so on.

Spark Session instantiation and resource allocation: Spark Sessions take time to setup, so the more often you start and stop your session, the longer everything will take. In addition, your job has to acquire all necessary resources from your cluster. Spark features such as dynamic allocation can make this faster, but there will always be some ramp-up time. This can also impact your pipeline at the cluster level. When using an autoscaling cluster like Amazon EMR, while you may theoretically have access to a large pool of resources, it may take significant time to assign them to your particular cluster via a scaling operation.

Saving and loading data: In a complex pipeline, you often have intermediate results you need to persist for other pipelines or jobs to consume. It’s often a good idea to break your pipeline into discrete steps, to both manage logical complexity, and limit the impact of job failures on your pipeline’s run time. However, if you make your steps too granular, you end up spending a lot of unnecessary time serializing and deserializing records to HDFS.

However, all of these impacts were small, on the order of minutes — they did not explain a 2-hour delay.

Sizing Up Your Pipeline

The shape and size of a DAG can be measured in two factors: the depth and the width.

The depth is the measure of how many links there are between any given task, and its closest edge node. The width is the number of nodes with some given depth.

The overhead tends to scale with the depth of your graph, particularly in “linear” segments of the graph, where tasks execute in series. However, wide graphs are not immune to the overhead — you must be cautious about the time spent when saving and loading records from HDFS. IO has fixed overhead, so while the landing time of your tasks may not suffer, the total compute time will, and this can lead to increased compute costs. (Managing cost is outside the scope of this post, so I’ll leave things there, but try to avoid trading a time problem for a cost problem.)

So, with this framework, we realized the structure of our pipeline was the root cause of our overhead problem. So what did we decide to do?

Simple — scale it down!

Phenomenal Data Processing, Itty Bitty Pipeline

Traditionally, your application code would look something like:

In this setup, each class maps to a single logical transformative step.

However, we have architected a Single Entry Point that defines a contract between the orchestration system and our business logic that has no inherent correlation to our application logic.

The basic usage of it looks something like:

This example, functionally, is pretty similar to the initial setup — we are still mapping tasks in our pipeline to classes in our code on a 1:1 basis.

However, this gives us a few benefits. For example, we can restructure our underlying classes however we want without making any changes to the orchestration system. We could do something like:

The critical goal to achieve here is to completely decouple the orchestration of your pipeline from the structure of your business logic, so that your orchestration-related components can be architected to minimize overhead, and your business logic can be structured to minimize its complexity. Put another way, you want to architect your application so the structure of your DAG is totally independent from the structure of your business logic.

Leveraging the system above, we could easily restructure our pipeline into:

In doing so, we have significantly reduced the overhead, without any changes to the structure of our business logic, or the delivery order of our tasks.

An Aside on Fault Tolerance

Conclusion

  • The natural evolution of data pipelines, from monolithic collections of scripts to Spark applications, naturally pushes you to encode your application structure in your pipeline.
  • The overhead is everything your pipeline is doing other than computation. It’s caused by orchestration complexity, and scales with the depth of your pipeline.
  • Encoding your application structure in your pipeline means you intrinsically couple your application logic to your orchestration logic. This means you are often inviting overhead by making your map-reduce tasks too granular.
  • By decoupling your orchestration logic from your application logic you gain tools to fight the overhead, without compromising the quality of your application.
  • When attempting to reduce the run time of a data pipeline, be careful not to miss the forest for the trees. Analyze your whole pipeline’s execution time, not just the obvious factors, like map-reduce computation time.
  • You can’t neglect fault tolerance considerations. Make sure you don’t lose all the time you saved lowering overhead by spending it retrying tasks that fail frequently.

While our execution of our overhead reduction measures are still ongoing, early tests show us that we will be able to decrease our overhead from 2 hours to 15–30 minutes. Not only will this improve the delivery time of our pipeline, this will allow us to pursue an hourly pipeline in the future.

If you’re interested in tackling pipeline scaling problems like these, the Airbnb Payments team is hiring! Check out our open positions and apply!

The Airbnb Tech Blog

Creative engineers and data scientists building a world…