Airflow, the easy way

Julien Kervizic
Dec 1, 2018 · 5 min read
Image for post
Image for post

What is Airflow

Airflow is a data orchestration and scheduling platform, in layman’s its a tool to manage your data-flows and data operations. It enables better management of what would have otherwise have been created through a cron job. Airflow revolves around the concept of directed acyclic graph (DAGs), a collection of tasks that are organized in directional manner handling their dependencies.

Image for post
Image for post

Airflow offers a management interfaces showcasing the status of every dag job run, whether it succeeded, failed, running or stuck on a retry mechanism.

Image for post
Image for post

It is possible to deep dive into the status of different tasks of the DAG, above for instance is tasks to pull data on sponsored products from Amazon’s Ads API for a few European marketplaces. Each marketplace has its own set of tasks periodically run. Airflow also provides the possibility to get alerted on failure or missed SLA.

Setting up Airflow

Image for post
Image for post

The easiest way to setup airflow is through one of the docker images from puckel, you can use one of the docker compose yml files in order to setup an environment.

Image for post
Image for post

With docker-compose, setting an environment for development purpose is as easy as “docker-compose up”.

Airflow allows for a choice of executor, Local or Celery. There are some limitation in terms of what a local executor can do.

The celery executor allows for the dispatch of tasks across multiple “workers”, instances meant to process the different DAGs and tasks.

Local executors on the other hand provide for an integrated solution that let you run the different airflow components; front end, scheduler, worker, … on a single instance.

It is possible to setup Airflow on AWS based on the docker-compose files mentioned before and Amazon Elastic Container Service. In essence what is needed to set up Airflow is to set up container instance of puckel/Airflow with different environment variables and different command execution.

Image for post
Image for post

Rather than using the default Redis and Postgres images provided within the puckel’s docker-compose file, it is better to use Amazon’s internal services for Redis and RDS Postgres as managed services.

On Azure it is possible to host your own, version of the Airflow container, and launch it as part of a container instance, app service or within Azure Kubernetes service.

Developing for Airflow (DAGs)

Image for post
Image for post

The setup of dags in airflow is built across 3 main concepts within it, operators, sensors and dependencies. Together they allow for building programmatically sets of tasks and their relationships and interdependencies.

Image for post
Image for post

Operators are wrapper around the specific code of the tasks that you wish to execute. They can be used to wrap around plan code in different languages like python and php or execution steps to fetch data from an FTP or move files to a data storage such as HDFS.

Image for post
Image for post

Sensors are a specific type of operator which role is to check if certain conditions have been met, this can be the case for checking that a file has been placed in a FTP folder, that a partition in a database has been created, …

Image for post
Image for post

Airflow allows for defining dependencies between tasks and only execute tasks provided the upstream dependencies have been met. This is done by the set upstream and set downstream functions or through bitshift operators (<< and >>).

Image for post
Image for post

Airflow allows for the management of what should be done if a dependency is not fully met. This can be done by setting the trigger rules of the different tasks (all success, all failed, all done, one success…) within an operator.

Wrapping up

Airflow provides tools that make it easier to manage data-flows and data processing steps in an integrated manner. It is built with a heavy engineering mindset and the pipeline definitions in it is built as code. Airflow makes it possible to handle programmatically the generation of pipelines.

A set of docker container exists to make it easy to set up both as a development environment and on the cloud. The creation of dags for data flows and processing purposes does require some experience and knowledge but is fairly easy to pick up and start developing upon it.

Hacking Analytics

All around data & analytics topics

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store