Apache-Airflow : A practical guide
“Data really powers everything that we do.” — Jeff Weiner
In Artificial Intelligence era, Big Data has become the source to solve problems. Gone are those days where data is collected and processed in batches. Real-time data processing is the key for modern business.
Besides Volume Velocity Variety, managing data pipeline has a significant role to play when it comes to availability. The complex part is developing multiple data pipelines to address different use-cases and managing all at one place.
This article explains the usage of an open source platform which can programmatically author, schedule and monitor workflows for data pipelines.
Yes, it is Apache-Airflow developed by Airbnb Engineering
“We chose it because we deal with huge amounts of data. Besides, it sounds really cool.” — Larry Page
Contents
- Introduction
- Insight
- Setup
- Development
- Deployment
- Common shortfalls
- Conclusion
Introduction
Airflow framework can be used to build workflows. A workflow could be anything from a simple linux command to a complex hive queries, a python script to a Docker file. Workflow comprises one or more tasks which are connected by Directed Acyclic Graph. Workflow which is called as DAG in Airflow, can be executed manually or it can be automated with schedulers like cron. Success and failure of DAG can be monitored, controlled and re-triggered. DAG state can be alerted with SMTP, Slack and other systems.
“With data collection, ‘the sooner the better’ is always the best answer.” — Marissa Mayer
“The world is one big data problem. There’s a bit of arrogance in that, and a bit of truth as well.”–Andrew McAfee
Setup
This section explains how to install airflow in MacOS High Sierra
Install Python
Download and install python from official website:
https://www.python.org/downloads/release/python-370/
I am using python 3.6
To test if the installation is successful
python -V
Python 3.6.X
Now install Airflow and its dependencies using pip
pip install apache-airflow==1.9.0
pip install apache-airflow[celery]
pip install mysqlclient
Install RabbitMQ
RabbitMQ website recommends installation via homebrew
https://www.rabbitmq.com/install-homebrew.html
Default username and password for rabbitmq server is guest
brew update
brew install rabbitmq
Add bin path to environment
$ vi .bash_profile
PATH=$PATH:/usr/local/sbin
To start Rabbitmq server in background
rabbitmq-server -detached
To check the status of Rabbitmq server
rabbitmqctl status
Install MySQL
Download MySQL for mac(.dmg) here
https://dev.mysql.com/downloads/mysql/
During installation process, set password for root user (I have selected legacy week password policy)
After installation follow these steps to create airflow database
mysql -u root -p
mysql> CREATE DATABASE airflow CHARACTER SET utf8 COLLATE utf8_unicode_ci;
mysql> create user 'airflow'@'localhost' identified by 'airflow';
mysql> grant all privileges on * . * to 'airflow'@'localhost';
mysql> flush privileges;
Initialize Airflow
First we have to initialize airflow with the below command
airflow initdb
This creates airflow directory in home path with cfg files and logs folder. Create dags folder in airflow directory.
mkdir -p ~/airflow/dags/
We have to set these configurations before running our first airflow code
vi ~/airflow/airflow.cfg
executor = CeleryExecutor
sql_alchemy_conn = mysql://airflow:airflow@localhost:3306/airflow
broker_url = amqp://guest:guest@localhost:5672/
Here SequentialExecutor is default executor which can execute one DAG at a time. We use CeleryExecutor to execute multiple DAGs in parallel.
Default database is sqlite which is not scalable. So we use mysql
CeleryExecutor requires RabbitMQ server which is configured in broker url
Now initialize airflow again to use MySQL as primary DB
airflow initdb
This creates necessary tables in MySQL airflow database
Now start airflow, worker and scheduler
airflow webserver
airflow worker
airflow scheduler
Hit the URL
http://localhost:8080/
Congrats!! Airflow is up and running 😆
“Data is the new science. Big Data holds the answers.” — Pat Gelsinger
Development
Its time to get our hands dirty. We will write our first DAG python code using PythonOperator
from airflow.operators import PythonOperator
from airflow.models import DAG
from datetime import datetime, timedeltaargs = {'owner':'srivathsan', 'start_date' : datetime(2018, 10, 1), 'retries': 2, 'retry_delay': timedelta(minutes=1) }dags = DAG('test_dag', default_args = args)def print_context(val):
print(val)def print_text():
print('Hello-World')t1 = PythonOperator(task_id='multitask1',op_kwargs={'val':{'a':1, 'b':2}}, python_callable=print_context, dag = dags)t2 = PythonOperator(task_id='multitask2', python_callable=print_text, dag=dags)t2.set_upstream(t1)
Let us walk through line by line
args is a dict of key value pairs passed as input to DAG
dags is an instance of DAG with dag name ‘My_first_dag’ and args as parameter
print_context() and print_text() are two python functions executed by DAG task
t1 and t2 are PythonOperators with task_id as task name, op_kwargs as arguments passed to function and python_callable
t2.set_upstream(t1) considers t1 and t2 as vertices and connects as directed graph
Deployment
Deploying Airflow is the easiest task
cp airflow_test.py ~/airflow/dags/
Copy the airflow python file and paste in airflow/dags/ folder
Airflow will recognise changes in dags folder and populates new DAG in webUI
To execute test_dag click play(>) button in Links column
After DAG execution, UI shows number of times the DAG executed in DAG Runs column and count of tasks executed in Recent Tasks column circled in green color indicating success.
We can view task dependency graph and code by selecting DAG name in DAG column
We can see the output of test_dag tasks by selecting task name -> View Log
Now we have seen basic example of Airflow. There are many features to explore. Some of the notable features are
- XComs — Cross communication between DAGs
- Authentication and Authorisation — User management to login and access DAGs
- Schedulers — Schedule DAG to execute like cron job
- SMTP , Slack— To alert with automated email for success and failures
- Retries and Retry delay — Automate retry in case of failure and delay time between two retry attempts
“There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.” — Eric Schmidt
Common shortfalls
Most common pitfalls worth noticing before start
pip install apache-airflow==1.9.0
Notice its apache-airflow. Package name used to be airflow till 1.8 and renamed to apache-airflow after releasing to Apache communityargs = {‘owner’:’srivathsan’, ‘start_date’ : datetime(2018, 10, 1), ‘retries’: 2, ‘retry_delay’: timedelta(minutes=1) }
These are few arguments passed to DAG() while creating dag instance. Avoid using datetime.now() because airflow scheduler requires only past datetime. Not present datetime.- Make sure to start airflow webserver, airflow worker and airflow scheduler before executing a DAG. You can also put these services in background.
nohup airflow webserver >> ~/airflow/logs/webserver.log &
nohup airflow worker >> ~/airflow/logs/worker.log &
nohup airflow scheduler $* >> ~/airflow/logs/scheduler.log &
“Everything is going to be connected to cloud and data… All of this will be mediated by software.” — Satya Nadella
Conclusion
simple and scalable platform with scheduling, monitoring and fault-tolerant capabilities in one word is Apache-Airflow. Credit goes to Airbnb for open source contribution and AirbnbEng team for designing and architecting fabulous framework.
Apache-Airflow — https://airflow.apache.org/
Airflow github — https://github.com/apache/incubator-airflow
Airbnb Engineering — https://airbnb.io/