Introduction of Airflow. tool for create ETL pipeline

Narongsak Keawmanee
4 min readAug 31, 2019

--

On data science world. we tend to move data around for calculate and analysis data for meet with business requirement. such as we want to know how many customer buying in this month. which product is best selling. which type of customer buy from this supplier. from business perspective we need to know as much as posible about these information for predict what will happend in future. and how we can growth business base on data. that why we have data science roles exist and this position is very popular among business and technology company.

What is ETL

ETL is short for Extract, Transform, Load data from one place to another place. like we move data from application database to store in data warehouse. this process will help maintain all data in one place. keep historical data as correct as posible. combine data from diffence source like combine data from web database and mobile database together and process information base on two source of data.

What is Airflow

Airflow first invented by Airbnb. the big tech company that provide booking room and hotels service. when data have much large and larger it hard manage it. that the reason Airflow come to help. Airflow will help us manage workflow of data with reliability. we can chain task together. if one task complete, run another task immidiately. and when task fail we know it fail by dashboard and email notification. this tools base on python so it don’t have problem with asynchronous data like javascript have.

Pre install.

Before we start install Airflow. we need to install python first. right now I use python3 and install with Homebrew by running command

brew install python

Install

We need to register path of Airflow. the tutorial recommand us to set at ~/airflow so we follow that

export AIRFLOW_HOME=~/airflow

Then install airflow

pip install apache-airflow

After that. let init database for see example code and task.

airflow initdb

That is. start airflow webserver with this command

airflow webserver -p 8080

And run schedule for process task

airflow scheduler

After finish. we open http://localhost:8080 and will see Airflow page

In DAGS menu. you will see DAG list and schedule of them. we will click at example_bash_operator to see what feature inside and I will describe it for you

Core concept of DAGs

tree view of DAG

DAGs stand for Directed Acyclic Graph. basicly it a collection of task that will runing after you set schedule and it will run all task inside base on condition you give to task.

In this picture. you will see if have multiple task inside 1 DAG like task runme_1, runme_2, runme_0, run_after_loop. each task have dependency between each other and can tell us which task will process first. if you want to see a simple diagram. you can click on top menu. Graph View

This graph is simple right? you will see if you doing task runme_0 it will process task run_after_loop and run_this_last after each task finish. that a power of airflow. chain task together and make sure all task is finish.

Codeview

Airflow provide interface that you can easily see what functionality of each task. you can click Code to see code inside.

Open and close DAG

Airflow provide On and Off button. for reason we want to maintainance this DAG or change code or process inside. we just turn it off and turn on when finish mainance by click off and on button on left side.

Other feature

Airflow provide other feature to support us so manything. such as

  • Ad Hoc Query : simple query interface that query data from airflow database.
  • Logs: for see what happen when we running task.
  • Pool: manage slot and queue process. how many slot avaliable for runing task as parallel.
  • Connection: connect to source of database. for store task information and Airflow data.
  • Variable: keep variable in airflow. for easy to reference in DAG task. also you can encrypt varaible that you store in airflow by using python encrypt libs that can plug into airflow.

That the most of it. I hope this article will guide you how Airflow work. what concept of it. in next article we will create a simple DAG of our own for make us understand more about how airflow work and how we can use it in real world of data science.

See you on next article.

Next article here. “Apache Airflow. Create ETL pipeline like a boss”.

I appreciated all cup of coffee. you can donate coffee at button below.

https://www.buymeacoffee.com/klogic

--

--

Narongsak Keawmanee

Software Engineer at Refinitiv • Highly ambitious • Love working with great people • Never Stop Learning