Introduction of Airflow. tool for create ETL pipeline
On data science world. we tend to move data around for calculate and analysis data for meet with business requirement. such as we want to know how many customer buying in this month. which product is best selling. which type of customer buy from this supplier. from business perspective we need to know as much as posible about these information for predict what will happend in future. and how we can growth business base on data. that why we have data science roles exist and this position is very popular among business and technology company.
What is ETL
ETL is short for Extract, Transform, Load data from one place to another place. like we move data from application database to store in data warehouse. this process will help maintain all data in one place. keep historical data as correct as posible. combine data from diffence source like combine data from web database and mobile database together and process information base on two source of data.
What is Airflow
Airflow first invented by Airbnb. the big tech company that provide booking room and hotels service. when data have much large and larger it hard manage it. that the reason Airflow come to help. Airflow will help us manage workflow of data with reliability. we can chain task together. if one task complete, run another task immidiately. and when task fail we know it fail by dashboard and email notification. this tools base on python so it don’t have problem with asynchronous data like javascript have.
Pre install.
Before we start install Airflow. we need to install python first. right now I use python3 and install with Homebrew by running command
brew install python
Install
We need to register path of Airflow. the tutorial recommand us to set at ~/airflow so we follow that
export AIRFLOW_HOME=~/airflow
Then install airflow
pip install apache-airflow
After that. let init database for see example code and task.
airflow initdb
That is. start airflow webserver with this command
airflow webserver -p 8080
And run schedule for process task
airflow scheduler
After finish. we open http://localhost:8080 and will see Airflow page
In DAGS menu. you will see DAG list and schedule of them. we will click at example_bash_operator to see what feature inside and I will describe it for you
Core concept of DAGs
DAGs stand for Directed Acyclic Graph. basicly it a collection of task that will runing after you set schedule and it will run all task inside base on condition you give to task.
In this picture. you will see if have multiple task inside 1 DAG like task runme_1, runme_2, runme_0, run_after_loop. each task have dependency between each other and can tell us which task will process first. if you want to see a simple diagram. you can click on top menu. Graph View
This graph is simple right? you will see if you doing task runme_0 it will process task run_after_loop and run_this_last after each task finish. that a power of airflow. chain task together and make sure all task is finish.
Codeview
Airflow provide interface that you can easily see what functionality of each task. you can click Code to see code inside.
Open and close DAG
Airflow provide On and Off button. for reason we want to maintainance this DAG or change code or process inside. we just turn it off and turn on when finish mainance by click off and on button on left side.
Other feature
Airflow provide other feature to support us so manything. such as
- Ad Hoc Query : simple query interface that query data from airflow database.
- Logs: for see what happen when we running task.
- Pool: manage slot and queue process. how many slot avaliable for runing task as parallel.
- Connection: connect to source of database. for store task information and Airflow data.
- Variable: keep variable in airflow. for easy to reference in DAG task. also you can encrypt varaible that you store in airflow by using python encrypt libs that can plug into airflow.
That the most of it. I hope this article will guide you how Airflow work. what concept of it. in next article we will create a simple DAG of our own for make us understand more about how airflow work and how we can use it in real world of data science.
See you on next article.
Next article here. “Apache Airflow. Create ETL pipeline like a boss”.
I appreciated all cup of coffee. you can donate coffee at button below.