Building a resilient Anomaly Detection pipeline with Apache Airflow

Manish Shambu
4 min readMar 16, 2020

--

Airflow is known for its powerful scheduling and orchestration capabilities. This article assumes that you already have some idea about airflow . If not, this is a great read.

Visa Inc. processes million of transactions every single day and moves trillions of dollars every single year. Reliability plays an important role and Visa does everything to keep the network up and running. It’s very crucial to detect transaction failures and alert in near real time .

Disclaimer: Due to security constraints from Visa, this article might hide a lot of actual implementation details. But, the information provided is generic enough to understand and build a resilient and scalable data pipeline using airflow!

Although Airflow supports multiple features and functionality, I discuss only a tiny portion of it here.

Airflow supports multiple types of executors such as Local, Sequential, Celery, Kubernetes etc. The type of executor can be chosen based on the requirements of your data pipeline. I will talk about the Celery executor in order to keep things simple. Celery is an opensource distributed async task queue. The tasks are executed by a number of distributed workers called celery workers that can listen on specific task queues. Airflow has integrated the celery open source project in order to distribute the work loads on multiple machines.

Typically every data pipeline job has 3 phases. Extract, Transform and Load.

Building a highly available extract phase

Extract — is one of the most important step and in any ETL job. Without data it is impossible to do anything further. Therefore it is very crucial to make data available.

Extract phase using airflow’s branch operator

Airflow’s branch operator is just like an if-else statement. Based on the active primary-secondary databases, you can easily branch into one of the available database and perform the extract phase. Airflow’s hooks such as MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator provides us the interface to connect to the database.

Building fast transform and load phases

Since, Airflow is a DAG scheduler, it can schedule tasks in parallel. We can exploit pipelining in order to process these phases in parallel.

ETL Pipelining

Building a highly available and fast anomaly detection phase

Millions of transactions are processed at hundreds of merchant terminal outlets and detecting anomalies in near real time can be challenging. Failures at each merchant terminal can be unique and we should support a unique anomaly detection model for each merchant. Building a near real time anomaly detection pipeline for hundreds of merchants on a single machine is impossible. Airflow’s Celery executor orchestrates and distributes the workload among multiple machines. It also supports SLAs and retries on failure. There can be multiple celery worker threads and each one of them can manage tasks independently. We can elastically scale up/down by adding/ removing celery workers to the queue.

Worker pools from multiple data centers

In order to build a disaster resilient model, we will have to consider pooling workers from multiple data centers. If one data center goes down, workers from another datacenter can still process the tasks and keep the pipeline running. All these workers and queues can be managed from the Celery’s Flower UI. Airflow’s scheduler creates tasks and distributes the workload to these workers through message brokers like Redis and RabbitMQ.

Performance Tuning

  1. Scheduler Threads — Increasing the max_threads in airflow.cfg spawns multiple tasks in parallel and can help decreasing latency.
  2. Scheduler heartbeat — More frequent the scheduler heartbeat, faster the DAG scheduling.
  3. Celery worker threads — Set this cfg to an optimal value, not too small or not too large.
  4. Parallelism, concurrency, max_active runs — Again these variables should be optimal. Increasing the max_active runs after an optimal value could cause bottleneck and queuing up of tasks.
  5. https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb has some of the common errors we might encounter while working with airflow.
  6. scheduler_duration — set this to -1 in airflow.cfg if you want to run it forever. Also consider restarting the scheduler every 10–15 days to keep it healthy.
  7. catchup — Set this as False, unless you want to backfill the tasks.

Some amazing Airflow blogs

  1. https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff
  2. https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb
  3. https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8
  4. https://blog.clairvoyantsoft.com/tagged/airflow

Without airflow, orchestrating the pipelines and processing data would have been a night mare. Thanks to the Airflow’s open source project.

Have any questions or thoughts? Let me know in the comments below! 👇

--

--