Scaling a Mature Data Pipeline — Managing Overhead

There is often a hidden performance cost tied to the complexity of data pipelines — the overhead. In this post, we will introduce its concept, and examine the techniques we use to avoid it in our data pipelines.

The view from the third floor at Airbnb HQ!

Background

Technical Stack

A Case Study: The Integration Test Pipeline

The Overhead: A Silent Killer

Sizing Up Your Pipeline

Phenomenal Data Processing, Itty Bitty Pipeline

An Aside on Fault Tolerance

Conclusion

  • The natural evolution of data pipelines, from monolithic collections of scripts to Spark applications, naturally pushes you to encode your application structure in your pipeline.
  • The overhead is everything your pipeline is doing other than computation. It’s caused by orchestration complexity, and scales with the depth of your pipeline.
  • Encoding your application structure in your pipeline means you intrinsically couple your application logic to your orchestration logic. This means you are often inviting overhead by making your map-reduce tasks too granular.
  • By decoupling your orchestration logic from your application logic you gain tools to fight the overhead, without compromising the quality of your application.
  • When attempting to reduce the run time of a data pipeline, be careful not to miss the forest for the trees. Analyze your whole pipeline’s execution time, not just the obvious factors, like map-reduce computation time.
  • You can’t neglect fault tolerance considerations. Make sure you don’t lose all the time you saved lowering overhead by spending it retrying tasks that fail frequently.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store