Introduction to Apache Airflow

ODSC - Open Data Science
5 min readMar 16, 2020

Apache Airflow is a tool created by the community to programmatically author, schedule, and monitor workflows. The biggest advantage of Airflow is the fact that it does not limit the scope of pipelines. Airflow can be used for building Machine Learning models, transferring data, or managing the infrastructure. Let’s take a closer look at the trending workflow management tool.

Pure python

Apache Airflow is one of a few Apache projects that are written in Python.

Taking into account that our users are mostly data scientists and data engineers — this gives us a huge advantage as most of them are familiar with python — not only the language but the whole ecosystem that they use on a daily basis. And it is pure Python work (snakes, snakes everywhere)! Airflow is not only written in Python but it expects you to write your workflows using the language! Yes, you get it — Airflow workflows are written in pure Python!

No more declarative XML or YAMLs. This approach has a lot of pros including:

  • Dynamic workflows using `for` and `while` loops and other python programming constructs
  • Datetime format used to schedule tasks
  • Easy to add new integrations — they are also written in Python!
  • Contrary to a number of “cloud-native” solutions for workflow engines — there is no need to know Docker, images, containers, registries. You still can do it (as Airflow has an excellent docker/Kubernetes integration) but you just need to know Python to do your job.
  • Since Airflow are not OCI/Docker container bound — it can easily iterate with any container approach (for example Singularity — popular amongst bioinformatics researchers)
https://odsc.com/boston/

Here is an example of simple Airflow workflow:

import timeimport datetimefrom airflow.models import DAGfrom airflow.operators.python import PythonOperatorfrom airflow.providers.email.operators.email import EmailOperatorfrom airflow.utils.dates import days_agodef task_function(random_base):   """This function will be called during workflow execution"""   time.sleep(random_base)with DAG(   "example_workflow",   default_args={"start_date": days_ago(1)},   schedule_interval=datetime.timedelta(minutes=40),) as dag:   start_email = EmailOperator(       task_id="start_email",       to="name@example.com",       subject="The pipeline has started",       html_content="<p>Your pipeline has started</p>"   )   end_email = EmailOperator(       task_id="end_email",       to="name@example.com",       subject="The pipeline has finishes",       html_content="<p>Your pipeline has finished</p>"   )   for i in range(5):       task = PythonOperator(           task_id='sleep_for_' + str(i),           python_callable=task_function,           op_kwargs={'random_base': float(i) / 10},           dag=dag,       )       start_email >> task >> end_email

In the example, we can see that we define two tasks using EmailOperator: `start_email` and `end_email`. Next in a `for` loop we add the next task that will execute a Python callable `task_function` that just sleeps.

The interesting part is the line with `start_email >> task >> end_email` . Because there we define a relationship between each loop task and both `start_email` and `end_email` tasks. In this way Airflow will first run `start_email` task, then it will execute all `sleep_for_*` tasks and finally, it will send another email via `end_email` task. Here is how this workflow looks in Airflow WebUI:

One tool to rule them all

The most important thing about Airflow is the fact that it is an “orchestrator.” Airflow does not process data on its own, Airflow only tells others what has to be done and when. And that is the biggest advantage of Airflow. While it is designed to support data processing and it is excellent at it — the workflows are not limited to it. Airflow workflows can mix a lot of different tasks. For example, a workflow can create a data processing infrastructure, use it to run the data workload, generate and send reports and finally tear down the infrastructure to minimize costs.

And what is more, Airflow already has plenty (literally several hundred!) of integrations that will handle all the work. That makes the process of creating pipelines really easy. All you have to do is to pick some already existing operators and mix them in the desired order. And even if you miss some integrations — it is very easy to add a new one! All you have to do is create a simple Python class.

Here is a shortlist of integrations Airflow supports currently:

  • Bash command
  • Python callable
  • Kubernetes (ex. running pods)
  • Docker (ex. running containers)
  • Google Cloud Platform (ex. Dataproc jobs, GCS upload)
  • Amazon Web Services (ex. Redshift query, S3 upload)
  • Azure Services
  • As of recently — Singularity containers (popular for people with bioinformatics background)
  • … many, many, many more …

Data intervals

There’s one thing that is confusing but also eye-opening once you wrap your head around it for many new Airflow users. Airflow works best for processing data intervals. It’s not a streaming data processing solution (and Airflow does not pretend to be one). Airflow works best when there are batched, fixed intervals of data to process. While this seems as limiting initially, it allows you to control the quality of your data. Imagine that you realize that over the last few weeks, your daily processed data has to be reprocessed because you found and fixed a bug in your processing pipeline. Or your metadata changed and you can better categorize your data for the last few weeks.

One of the great examples from Airflow users is to process telecom data where you map (IMEI)[https://pl.wikipedia.org/wiki/International_Mobile_Equipment_Identity] of the phone to particular phone models. Mappings from IMEI to phone models are usually delayed by days or weeks and when you get new metadata you would like to reprocess a few weeks of your already processed data with minimal processing time. Airflow with its data-interval centric approach, backfill capabilities and reprocessing only the part of a data pipeline that is needed — is a perfect solution to that!

If you want to learn more about Airflow and its internals join us at our upcoming talk at ODSC East, “Introduction to Apache Airflow“!

Jarek Potiuk, Principal Software Engineer & Apache Airflow PMC Member, Polidea | Apache Software Foundation

As the CTO Jared built software house 10-fold: from 6 to 60 people. After a few years of being the CTO, he decided to go back to a full-time engineering role and he works as a Principal Software Engineer in his company (and is super happy about it). Jarek has been working as an engineer in many industries — Telecoms, Mobile app development, Google, Robotics and Artificial Intelligence, Cloud and Open Source data processing. Jarek is currently PMC in the Apache Airflow project.

https://www.linkedin.com/in/jarekpotiuk/
https://github.com/potiuk

Tomasz Urbaszek, Software Engineer & Apache Airflow Committer, Polidea | Apache Software Foundation

Tomek is a Software engineer at Polidea, Apache Airflow committer and book lover. He fancies functional programming because he is a graduate mathematician. Every day he tries to make the world a better place.

https://www.linkedin.com/in/tomaszurbaszek/
https://github.com/nuclearpinguin

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.