At Groupon’s scale, over 800 microservices serve more than 50M customers by offering products, activities, travel, and services. These services generate more than hundreds of terabytes of data on a daily basis. In a quest to improve our customers’ 360-degree experience, our engineering and data science teams leverage this data for marketing analytics and predictive model development every day.
Operating a complex marketplace like ours requires crunching big data so that our business can take advantage of the patterns observed on our platform. And since big data pipelines don’t run by themselves, we utilize Apache Airflow to schedule, monitor, and orchestrate them.
Hunt for the best Workflow Management System
Today at Groupon, we rely heavily on Airflow as our primary workflow automation tool for multiple systems. But prior to 2018, we used Cron to schedule our jobs, and job failure alerts due to upstream unavailability was common.
While the Cron solution was suitable earlier, its limitations became pronounced and more chatty as our systems grew. Some prominent shortcomings we noticed were:
- Ad hoc retriggering of the job in the event of failure
- Manual rescheduling of downstream during an upstream failure
- Jobs were time-bound rather than event-bound
- No built-in mechanism for managing dependencies and triggering alerts
- Changes to the crontab file were not versioned
- Lack of well-defined user interface
Speaking of numbers, our Consumer Data Authority Platform (the platform that governs everything we know about our consumers) at that time had 100+ jobs and 35+ workflows. With that scale, it was nearly impossible for an on-call engineer to manage the system in the event of failure. At this time, we started looking for alternative open-source solutions.
Placing our bet on Airflow
As part of the CDAP Team, we carried out proof of concept on some of the best solutions available in the market and evaluated the below-listed KPIs to best suit our use case.
In the comparison chart below, we have listed the KPIs relevant to our team and the product we were developing.
After performing the exercise, it was evident that Apache Airflow was the right contender for our use case. Being battle-tested at big companies and having a rich open source community made it a compelling choice.
Airflow Architecture at Groupon
Apache Airflow currently provides five types of executors for production deployment.
- Sequential Executor — It is a single process executor that runs one task instance at a time.
- Debug Executor — It is a single process executor meant as a debug tool to use from IDE.
- Local Executor — It is a single node executor that runs tasks by spawning processes in a controlled fashion in different modes.
- Celery Executor — It is a multi-node executor that can scale out the number of workers/nodes.
- Kubernetes Executor — This is a Kubernetes-based executor that adds a new pod for every task instance and helps in scaling.
Easing the monitoring and management of our Apache Spark jobs was the primary reason for migration to Airflow. These jobs run in a Hadoop cluster having 550 nodes and 28 PB of storage. Running an average of 3–4 hours, these jobs process multiple TBs of data daily. We proceeded with Local Executor-based architecture since Spark pipelines did most of the heavy lifting. Our architecture is multi-tenant, and all the services in our team use the same deployment.
Takeaways from this migration
Airflow helped us optimize the entire CDAP data pipeline without worrying about the job dependencies. It played a vital role in breaking our monolithic codebase into a modularized codebase. It also assisted us to integrate multiple independent in-house applications like Data Quality Framework, Email Alerting, SLA, Daily Health Status jobs. Overall our prominent achievements in migration were:
- Reduction in code duplication and LOC.
- Interactive and easy-to-use UI for Job Monitoring.
- Reduced pager alerts due to automatic retry approach.
- Big monolithic workflows were broken into smaller workflows (i.e. DAGs in Airflow) with dependencies.
- Workload got segregated across the day based on the business need.
- Complex Workflow using dynamic DAGs was achieved using Airflow.