Apache-Airflow : A practical guide

“Data really powers everything that we do.” — Jeff Weiner

“We chose it because we deal with huge amounts of data. Besides, it sounds really cool.” — Larry Page

Contents

  • Introduction
  • Insight
  • Setup
  • Development
  • Deployment
  • Common shortfalls
  • Conclusion

Introduction

“With data collection, ‘the sooner the better’ is always the best answer.” — Marissa Mayer

Insight

“The world is one big data problem. There’s a bit of arrogance in that, and a bit of truth as well.”–Andrew McAfee

Setup

python -V
Python 3.6.X
pip install apache-airflow==1.9.0
pip install apache-airflow[celery]
pip install mysqlclient
brew update
brew install rabbitmq
$ vi .bash_profile
PATH=$PATH:/usr/local/sbin
rabbitmq-server -detached
rabbitmqctl status
mysql -u root -p
mysql> CREATE DATABASE airflow CHARACTER SET utf8 COLLATE utf8_unicode_ci;
mysql> create user 'airflow'@'localhost' identified by 'airflow';
mysql> grant all privileges on * . * to 'airflow'@'localhost';
mysql> flush privileges;
airflow initdb
mkdir -p ~/airflow/dags/
vi ~/airflow/airflow.cfg
executor = CeleryExecutor
sql_alchemy_conn = mysql://airflow:airflow@localhost:3306/airflow
broker_url = amqp://guest:guest@localhost:5672/
airflow initdb
airflow webserver
airflow worker
airflow scheduler
http://localhost:8080/
Airflow home page

“Data is the new science. Big Data holds the answers.” — Pat Gelsinger

Development

from airflow.operators import PythonOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {'owner':'srivathsan', 'start_date' : datetime(2018, 10, 1), 'retries': 2, 'retry_delay': timedelta(minutes=1) }dags = DAG('test_dag', default_args = args)def print_context(val):
print(val)
def print_text():
print('Hello-World')
t1 = PythonOperator(task_id='multitask1',op_kwargs={'val':{'a':1, 'b':2}}, python_callable=print_context, dag = dags)t2 = PythonOperator(task_id='multitask2', python_callable=print_text, dag=dags)t2.set_upstream(t1)

Deployment

cp airflow_test.py ~/airflow/dags/
test_dag in Airflow UI
test_dag after execution
test_dag Graph View shows two tasks
code of test_dag
Selecting multitask1
Selecting View Log shows output of multitask1 highlighted in log
  • XComs — Cross communication between DAGs
  • Authentication and Authorisation — User management to login and access DAGs
  • Schedulers — Schedule DAG to execute like cron job
  • SMTP , Slack— To alert with automated email for success and failures
  • Retries and Retry delay — Automate retry in case of failure and delay time between two retry attempts

“There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.” — Eric Schmidt

Common shortfalls

  • pip install apache-airflow==1.9.0
    Notice its apache-airflow. Package name used to be airflow till 1.8 and renamed to apache-airflow after releasing to Apache community
  • args = {‘owner’:’srivathsan’, ‘start_date’ : datetime(2018, 10, 1), ‘retries’: 2, ‘retry_delay’: timedelta(minutes=1) }
    These are few arguments passed to DAG() while creating dag instance. Avoid using datetime.now() because airflow scheduler requires only past datetime. Not present datetime.
  • Make sure to start airflow webserver, airflow worker and airflow scheduler before executing a DAG. You can also put these services in background.
    nohup airflow webserver >> ~/airflow/logs/webserver.log &
    nohup airflow worker >> ~/airflow/logs/worker.log &
    nohup airflow scheduler $* >> ~/airflow/logs/scheduler.log &

“Everything is going to be connected to cloud and data… All of this will be mediated by software.” — Satya Nadella

Conclusion

Happy Learning 😃

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store