Airflow in MOMO

Hai Nguyen
7 min readJun 28, 2020

--

Building a big data platform for a fintech company with a small team, we have faced many problems as one may come upon in this industry. Different solutions were thus invented to deal with them in ideal and sustainable ways.

The problems:

  • We had a bulk of jobs (such as service deployment, deployment & shutdown of clusters on the cloud, launching a variety of jar files, bash scripts, python scripts) periodically being run and integrated with cloud services (AWS, Google cloud service…)
  • Our job dependencies were complex. It depended on scheduling hours, data availability, job status in cloud service, duration between 2 jobs.
  • Lack of human resources: We were just a small team that owned the full data platform to create our specific scheduler framework. We needed an available framework that can be easily installed, extended, maintained.

Thus we had to find the right tools for the job

Our requirements:

  • Workflows to be easily created and version controlled.
  • Easy to understand for other teams like Data Analytics, Data Sciences Team
  • Have a good UI for managing workflow.
  • Easy to extend/customize due to our specific requirements.
  • Good support, open-source preferred.
  • Basic features of scheduler tool: backfill mechanism, manual/automatic trigger any task, dependency management, failure handling.

There were many options available

The Options:

  • Quartz Framework, core library for many scheduler frameworks (Oozie,..) but we would have to build so many things at the beginning.
  • Oozie system is an open-source workflow scheduling system written in Java for Hadoop systems. It uses XML, java to define the workflow. According to our experiences working with Oozie, it takes so much time to set up (except you already have installed HDP), understands the logic to create DAG with complex dependencies by XML. The community is small, less active.
  • Finally pick Apache Airflow, for it meets our requirements and resolved our issued as our workflow scheduler system at MOMO

What is Airflow?

You can find detailed information about Airflow on their website https://airflow.apache.org/, below We will summarize the key concepts:

  • DAG ( directed acyclic graphs ):

It represents the workflows. Each node in the graph is a specific task. Edges define dependencies between tasks. DAG is used to organize tasks and set its execution context. DAG does not perform any computation, the tasks do.

  • Operator (task):

Specifies the type of computation needed to be executed based on dependency requirements.

  • Sensor:

A blocking, checking task. It checks the status of any process, the data structure in the supported platform (for example: check the state of the query in Google BigQuery, if success, the sensor will be marked as success and sensor’s next task will be a trigger)

How We used Airflow at MOMO.

Deployment:

We were using an Airflow Docker image( github ) on our server. As mentioned before, as a small team, Docker Technology helps us to easily deploy and experiment airflow scheduler in any server installed with a Docker Daemon. We deployed airflow in docker with “Celery” mode separating Airflow from other components. Each component is deployed in a container. It brings to us many advantages:

  • Deploying in Docker enables us to easily shut down airflow, do some changed configurations, and bring it back to online.
  • Celery mode lets us scale up the number of workers if needed and execute any task in DAGs without any problem (just Celery support)

Test DAG file:

We usually did some smoke tests with DAGs before deploying in production and pushing to master branch ( source ). It would use the airflow library to verify DAG file syntax, do some python script to checkout DAG owner, DAG email.

Applied Projects:

BigEyes Project (Transaction Reconciliation) : MOMO is a fintech company working with many banks in Vietnam (about 10–12 banks). Purpose of the project is to check the state of each transaction ASAP as they enter MOMO. Each bank has a different method to check the status and usually just operate in a certain time of day, certain days of week, so airflow is best suitable for us. We defined a specific workflow for each bank with a predetermined schedule.

Data Platform Project (Migration Data, ETL Data): Airflow completely suits our Data Platform Project. We did batch loading data from many sources such as Cassandra, Oracle, MSSQL, Kafka, text file, elastic search to data lake GCP. Due to the Big Data migration, we deployed Airflow in Spark Cluster servers, we had to do some tricks in order to make full use of combining Spark cluster and Airflow. Airflow also can integrate with many Google cloud platform components like BigQuery, Dataflow, our data analytic team can define their analytic job in airflow to ETL, analyze data in BigQuery.

Here come the pros of AirFlow

Pros:

Large contributors & support: as of 29 June 2020, Airflow’s already had 55 plugins, 1204 contributors, and 8,848 commits. Apache Airflow powers the new workflow service that Google rolled out 1 May 2018 called Cloud Composer. It has proved Airflow is a popular open source project with a strong and growing community.

Failure Handling & Monitoring: Airflow allows us to configure retry policies, alert setting, SLA duration. Airflow also provides WebUI for us to manage all dags deployed in airflow.

Home page: all dags will display here

Dag Management: view status, dependencies of dag run in history

Dag management: view code of dag

Dag Management: view task log

Extensible: Airflow Operators (Task) are built-in Python and open source and they are easy for us to create our custom Operators (Task) suitable with our field.

Variety Integration: Airflow is able to integrate tasks (submit, check job status, manipulate data, deploy service) with AWS (EMR, S3, Athena…), Google cloud platform (Dataflow, BigQuery, GCP,..).

DAG defined by python: with powerful of python, we can create from simple to complex DAG. For example in an ETL process in MOMO:

  • Workflow graph:
  • DAG dependency definition code in python

Next up, we have the cons,

Some issues that we faced so far:

Log files: airflow scheduler, workers produced so many log files and Airflow has no logs rotation mechanism or cleaning logs. You, therefore, have to manually clean up the log file. At MOMO we are using this tool with some customization for scheduled cleaning log

Airflow UI: Airflow UI (airflow web server component) update is lower than the worker & scheduler component when you update the dag file. Sometimes, Airflow Web UI takes 5–10mins to update the latest status of dags

Update Dag’s schedule: Airflow has no update mechanism for dag’s scheduled time. If we change the scheduled time: cron-expression, start time, interval execution, dag would be corrupted. The only way to update them is to change data in the Airflow backend database. We, therefore, create a simple tool to modify it.

Stop or kill a bulk of running tasks: It is hard for us when accidentally starting several tasks and there is no way to stop them except stopping their thread in the worker machine and you have to manually update task status in the Airflow backed database.

Timezone issue: Airflow relies on system timezone (instead of UTC, in Airflow 1.9 Airflow has independent Time zone UTC, pull information) for scheduling. This requires the entire airflow set up to be run in the same time zone.

Mechanism trigger: Airflow was built for data batch processing due to which the Airflow designers made a decision to always schedule jobs for the previous interval. Hence, a job scheduled to run daily at midnight will pass in the execution date “2018–07–07 00:00:00” to the job running on “2018–07–08 00:00:00”. This can get confusing especially in the case of jobs running at irregular intervals.

Summary

Apache airflow is growing fast in popularity in the tech community. It also powers services for Google. At the moment Airflow is one of the most powerful tools at MOMO. Airflow still has many features, abilities we have not discovered yet and it is developing by the community. There are many things our team needs to do to improve our deployment process, security process but We hope our short brief about Airflow at MOMO helps you guys have another option in choosing new technology for your project.

--

--