Apache Airflow — Plugins, SubDAGs and SLAs

M Haseeb Asif
Big Data Processing
3 min readApr 6, 2020

Airflow Plugins

Airflow was built with the intention of allowing its users to extend and customize its functionality through plugins. The most common types of user-created plugins for Airflow are Operators and Hooks. These plugins make DAGs reusable and simpler to maintain.

Custom hooks allow Airflow to integrate with non-standard data stores and systems. Operators and hooks for common data tools like Apache Spark and Cassandra, as well as vendor specific integrations for Amazon Web Services, Azure, and Google Cloud Platform can be found in Airflow contrib. If the functionality exists and it is not quite what you want, that’s a great opportunity to add that functionality through an open-source contribution.

To create a custom operator, First, we need to identify Operators that perform similar functions and can be consolidated. Then it as easy to create an operator in the plugin folder. All the custom operators need to extend the BaseOperator and implement the execute method. Airflow calls the execute function to call trigger the operator execution.

from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults

class HelloOperator(BaseOperator):

@apply_defaults
def __init__(
self,
name: str,
*args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.name = name

def execute(self, context):
message = "Hello {}".format(self.name)
print(message)
return message

For imports to work, we have to place the file in a directory that is present in the PYTHONPATH env. Airflow adds dags/, plugins/, and config/ directories in the Airflow home to PYTHONPATH by default.

Best Practices

Every task in your dag should perform only one job. If you revisit a pipeline you wrote after a 6-month absence, you should be able to understand it easily that how it works and the lineage of the data if the boundaries between tasks are clear and well defined. This is true in the code itself, and within the Airflow UI. Also, tasks that do just one thing are often more easily parallelized. This parallelization can offer a significant speedup in the execution of our DAGs. Hence DAG tasks should be designed such that they are:

  • Atomic and have a single purpose
  • Maximize parallelism
  • Make failure states obvious

SubDAGs

Commonly repeated series of tasks within DAGs can be captured as reusable SubDAGs. For example, when copying data from different tables and then cleansing them, we can create a SubDAG for similar data types. Benefits include:

  • Decrease the amount of code we need to write and maintain to create a new DAG
  • Easier to understand the high-level goals of a DAG
  • Bug fixes, speedups, and other enhancements can be made more quickly and distributed to all DAGs that use that SubDAG

On the other hands, some of the downsides of SubDAGs are

  • Limit the visibility within the Airflow UI
  • Abstraction makes understanding what the DAG is doing more difficult
  • Encourages premature optimization

We can also nest SubDAGs. However, you should have a really good reason to do so because it makes it much harder to understand what’s going on in the code. Generally, subDAGs are not necessary at all, let alone subDAGs within subDAGs.

SLAs

Emails and Alerts

Airflow can be configured to send emails on DAG and task state changes. These state changes may include successes, failures, or retries. Failure emails can allow you to easily trigger alerts. It is common for alerting systems like PagerDuty to accept emails as a source of alerts. If a mission-critical data pipeline fails, you will need to know as soon as possible to get online and get it fixed.

Metrics

Airflow comes out of the box with the ability to send system metrics using a metrics aggregator called statsd. Statsd can be coupled with metrics visualization tools like Grafana to provide you and your team high level insights into the overall performance of your DAGs, jobs, and tasks. These systems can be integrated into your alerting system, such as pagerduty, so that you can ensure problems are dealt with immediately. These Airflow system-level metrics allow you and your team to stay ahead of issues before they even occur by watching long-term trends

References:
1. Notes from Udacity course
2. https://airflow.apache.org/docs/stable/howto/custom-operator.html

--

--

M Haseeb Asif
Big Data Processing

Technical writer, teacher and passionate data engineer. Love to talk, write and code with Apache Spark, Flink or anything related to data