Book Review: “Data Pipelines with Apache Airflow” by Harenslak and De Ruiter

Gabriel dos Santos Gonçalves
Plumbers Of Data Science
4 min readJul 14, 2022

--

Why this is probably the best book about Airflow

1. Book overview

It’s becoming more clear that a solid Data Engineering infrastructure is essential for any organization dealing with any sort of Data product. Moving, transforming, and storing Data are tasks being performed by a wide variety of professionals and tools, and having it all work together can be hard to implement.

Data Pipeline Orchestrators are tools designed to help companies organize and schedule tasks involving data processing and have gained a lot of attention in the past few years.

Airflow is the industry standard for Data Pipeline Orchestrators as its adoption has increased significantly in the past years.

Airflow is an Open Source project with many contributors and possibilities in terms of compatible services that can be integrated within the platform. And even though Airflow has decent official documentation, understanding its basic architecture and deploying it to production can become a daunting task.

“Data Pipelines with Apache Airflow” by Bas Harenslak and Julian De Ruiter feels this gap in the literature, offering a…

--

--