Data Pipeline Infrastructure at Meetup with Fewer Nightmares: Running Apache Airflow on Kubernetes
Meetup Data Platform Infrastructure — a Starting Point
Making sure the right data is in the right place at the right time is critical to data processing at Meetup. Recommendations, search results and internal dashboards tracking product health all depend on up-to-date and reliable data pipelines.
At Meetup, the Data Platform team puts data where it should be via the twin pillars of ETL (Extract, Transform, Load) processing and job orchestration. The system has evolved over time. Previously, our ETL process consisted of a series of custom Scala scripts scheduled via cron. That worked well with simple pipelines with tens of tables. As hundreds of new data tables were introduced and more complicated dependencies emerged, our homespun tool became fragile and a nightmare to maintain.
Why a nightmare?
- Complicated dependencies. As pipelines grew, dependencies between different parts of our ETL process became increasingly complicated. We didn’t have an easy way to visualize or modify the dependencies between different jobs. As a result, bugs were inadvertently (and frequently) introduced.
- Hard to identify the point of failure. We hadn’t designed a straightforward way to identify the point of failure in data pipelines consisting of dozens of jobs — each with many potential points of failure.
- Difficult to rerun jobs. With so many dependencies, it became harder to check dependencies and rerun jobs. Thus, our data engineering and data science teams spent more time maintaining and repairing data pipelines than growing our capabilities.
The time had come to rethink our commitment to our homegrown tools — not just for the nightmares of today but to prevent the nightmares of tomorrow: the simple cron no longer met our requirements. Our needs had evolved; we needed something to schedule and manage our data pipelines at a larger scale.
Why Meetup Chose Apache Airflow
We evaluated a few different options for job orchestration. At the time, Luigi relied on cron for scheduling but we wanted to use something for that was able to handle more complex logic in our data pipeline scheduling. Oozie was tied to Hadoop but we had use cases other than Hadoop. Azkaban used config files and wasn’t easy for the team to use. Airflow seemed like our best option for the following highlights:
- Workflows are defined as code. Drag & drop tools are easy to use, but code is more maintainable, testable and collaborative.
- Airflow has its own scheduler which monitors and schedules everything related to Airflow and the scheduler is configurable by use cases.
- Easy to use UI. Many tasks can be achieved in Airflow UI itself, for example, backfilling, kicking off a new job, visualizing the DAG and job status by various views. The UI makes it possible for users from different teams to see the progress of their jobs easily.
- Scalability is easily achieved by adding or removing worker nodes. For example, a local development or QA environment can use the sequential executor to test code easily while production can use Celery, Dask or Kubernetes as multi-worker nodes.
- Support for running many kinds of jobs and interacting with many kinds of datastores is baked into Airflow. For example, Airflow provides HiveOperator to allow users to connect and run hql. Meanwhile, it is easy to add user-defined code to add functionality to the built-in operators.
Although orchestrating our ETL data pipeline was the initial use case, we quickly found out Airflow could do more than that. For example, our data science and machine learning teams are using Airflow to schedule model training jobs. We also use Airflow to schedule imports of customer data from 3rd party vendors like Stripe. As a result, Airflow soon became a key framework for Meetup.
Airflow Use Case
Now that we have covered why we chose Airflow, let us go over three specific use cases here at Meetup.
1. Use case: Analytics Pipeline (ETL Pipeline)
We use Airflow to orchestrate nightly ETL jobs. All pipeline instructions are defined in YAML format and Airflow parses the files to perform extract, transform and load for all tables. In this way, without any code changes, adding new tables, transformations or dependencies is as simple as adding a few lines to the configuration files. We also include some metadata to simplify the process like source, destination and priority of the execution order, so that users can easily specify those metadata. When failures happen, Airflow first retries, and if the issue still persists, a pagerduty alert will be triggered. Airflow also sends slack messages about the ETL progress; users can easily follow the status and check the health of the ETL pipeline.
2. Use case: Machine Learning
The machine learning team at Meetup is using Airflow to schedule spark jobs for feature generation, model training and model scoring. Airflow provides an easy way to submit spark jobs to a cloud provider, visualize job status and performance, restart failed jobs automatically and define dependencies straightforwardly. For example, Airflow orchestrates model feature generation and scoring jobs to serve recommendations to Meetup members. Airflow submits spark jobs on both GCP Dataproc and AWS EMR, and runs multiple feature generation jobs in parallel. Once all feature generation jobs finish successfully, the prediction job then is triggered.
3. Use case: Multiple Cloud Provider Data Sharing
Meetup uses Airflow to share data between AWS and GCP. For example, to make sure that we have an up-to-date mirror of the dataset, we use Airflow to schedule an rsync process to transfer the binlog of mysql data from S3 to GCS every hour.
Airflow Infrastructure Architecture and Deployment Process
Meetup currently has two cloud providers: AWS (Amazon Web Services) and GCP (Google Cloud Platform). One of the major reasons we initially chose to deploy Airflow in GCP was Kubernetes (GCP was the only cloud provider of Kubernetes at that time) giving us scalability, immutability and self-healing “for free” allowing developers to focus on data pipeline logic instead of babysitting the infrastructure.
We are currently using the following components to run Airflow:
Airflow Webserver provides a feature-rich UI for both technical and non-technical users to interact with DAGs easily and find logs quickly enabling fast debugging.
Airflow Scheduler schedules tasks for the executor to run. It also handles system overhead (e.g. tasks retries, max concurrency, etc.)
Airflow Workers pull tasks from the queue and executes the scheduled jobs.
Airflow DAGs define how to run specific jobs.
GKE (Google Kubernetes Engine) manages Docker containers easily, it enables us to automatically restart the container if the Airflow web server or scheduler hasn’t responded for a given time.
Cloud SQL is our backend of Airflow. We also query metadata to a get a summary view. For example, you can get average task duration in the last 30 days for a DAG.
StackDriver streams logs in near real-time for searching and debugging.
Continuous Integration/Continuous Deployment (CI/CD) Pipeline
Developing Airflow DAGs took some special care at Meetup. First, each commit or pull request affecting Airflow needs to pass a set of unit tests to make sure all processing doesn’t break. Second, once code has been reviewed and merged to master branch, Travis will perform continuous integration and continuous deployment.
We created two repositories for Airflow at Meetup:
- Airflow Server includes server deployment files, python packages, credentials and Airflow configurations.
- Airflow DAGs includes the DAG files and utilities class. Since the deployment of Airflow server pauses everything and kubernetes creates a new pod for all containers, there will be a short downtime for those type of changes. To avoid downtime, we put DAG files into a different repository. When code is merged to the Airflow DAGs repository, we simply copy those files to $AIRFLOW_HOME/dags. Since Airflow continuously refreshes DAGs under this folder, when the next time DAGs refresh, the new ones will be picked up. However, there was one issue we noticed: if the code merge changes a DAG while it is running, Airflow will respect those changes as well. If there are conflicts, such as currently running tasks being removed or updated, some unexpected results may occur.
Fragile spot: Airflow System DAGs
After using Airflow in production for a few months, we noticed two cases when Airflow became unstable and would stop scheduling tasks. (Note, this is for Airflow 1.8.1 and may not apply to newer versions.)
- Airflow webserver is not reachable or the Airflow scheduler hangs.
- Airflow puts tasks in the queued state and never schedules them.
To handle the first issue, we used livenessProbe of Kubernetes and created a system DAG at Meetup. If the livenessProbe for the web server container fails, the kubelet kills the container and starts a new one.
The livenessProbe handles the case of an unreachable Airflow webserver well. However, livenessProbe for Airflow scheduler is a bit different. By actually scheduling any DAG, we could know whether Airflow scheduler is healthy or not. To use livenessProbe for scheduler properly, we created a dummy DAG which touches a dummy file every 5 mins. Then we enabled the livenessProbe to see if the file has been touched within a certain period.
The second issue — tasks remain in queued state forever — is due to overloaded workers on Airflow 1.8.1 (more detail can be found here). We have another system DAG to query Airflow meta storage to check if the same task has been in a “queued” state for over 30 minutes. If that happens, the clear task function will be performed.
- Run Airflow on Kubernetes to reduce the amount of effort required to manage an Airflow cluster.
- Split Airflow DAGs and server repositories to avoid deployment downtime.
- Provide an easy way to show the ETL progress to users.
- Sometimes the default Airflow operators cannot meet your requirements. Fortunately, Airflow makes it very easy to create custom operators. For example, we created our own EMR operator to be able to use spot pricing.
- Keep a copy of Airflow logs in a storage for easy for searching and debugging.
With Kubernetes Executor and Operator to be released in Airflow 1.10, we are thrilled to explore those options to make the data platform at Meetup more scalable and reliable to help everyone build a better Meetup.