Data Pipeline using Apache Airflow

Abin Joy
7 min readFeb 22, 2020

Data Pipeline

Data volumes have increased substantially over the years, as a result of that business needs to work with massive amounts of data. In this case we need data pipeline to handle data flow efficiently, because handle storage, analysis and visualization of data in same system is not a good idea.

Moving data between systems requires many steps: data mining, storage in cloud systems, reformating or merging with other data sources like so.

A data pipeline is the sum of all these steps, it’s job is to automate all these steps and make that these steps all happen reliably to all data.

Meet Apache Airflow

Apache airflow is an open-source workflow management system (WMS) used to manage computational workflows and data processing pipelines. It can programatically author, schedule, and monitor workflows. It is developed by Airbnb.

Some pipelines use real-time data while others use batch data. Both approach have it’s on benefits. Apache airflow is a platform for developing and monitoring batch data pipelines.

Airflow uses workflows made of Directed Acyclic Graphs (DAGs) of tasks, each task’s output is act as input for another task in the workflow.

--

--