How Quizlet uses Apache Airflow in practice
Part Four of a Four-part Series
In Part III of Quizlet’s Hunt for the Best Workflow Management System Around, we highlighted some of Airflow’s key concepts by implementing a hypothetical example workflow introduced in Part I of the series. In this final installment, we’ll go over Quizlet’s initial deployment of Airflow and highlight some practical learnings we gathered along the way.
Quizlet’s Airflow Deployment
Airflow’s design allows it to be deployed at various levels of complexity. For example, in a fully partitioned deployment (Figure 4.1, Left Subpanel), each of the main components has their own dedicated machine or file systems. This advanced setup is appropriate for organizations operating at scale that need to run a large number of tasks distributed across a network of machines having various hardware configurations. This setup also offers a clean separation between processes, making the system more robust and fault tolerant. However, this setup can be overkill if a) the number of workflows is relatively small, b) the majority of tasks being executed are lightweight, and c) those tasks can be executed on homogeneous hardware configurations.
Quizlet first adopted Airflow to execute an array of analytics ETLs. These ETLs extract events data from Google BigQuery (for details on how we stream events data to BigQuery, see this very cool blog post), perform standard analytics logic in SQL (e.g. calculating daily active users, retention rates, study behavior metrics, etc.) and store the results back in BigQuery tables that are to be used for reports and dashboards. All of the ETL operations are lightweight HTTP requests to the BigQuery API, and can be run with fairly little overhead.
Given these considerations, we opted to initially deploy Airflow to a single large instance (8 cores, 50Gb of memory — setup is depicted in Figure 4.1, Right Subpanel). The deployment is configured to run a local MySQL database to store metadata and execution logs are stored on the local file system. We use the
LocalExecutor to parallelize task execution on local processes. Additionally, because the web server traffic is generally low the size of our team, we also run the web server along with the scheduler as
systemd services on the same instance. All DAG definitions are managed from a Git repository and synced to the the instance’s local file system. We use Puppet to manage Airflow’s setup and the configuration of the scheduler and web server daemons.
In general, getting Airflow up and running was straight-forward. The following examples of configuring Airflow with Puppet were helpful guides for our deploying with Puppet, and there was ample documentation and conversations within the community that highlighted some of the known gotchas we should be aware of. That said, every Airflow setup will highlight different idiosyncrasies of the framework, and we wanted to contribute some learnings specific to our experiences that may inform any reader who is considering working with Airflow.
Airflow Pros & Cons
Airflow was able to check off nearly every item on our wish list, so the positives essentially speak for themselves. Although Airflow has a lot going for it, it’s not perfect. In particular, we found issue with the following:
- Substantial setup is required. In order to run Airflow, you need at a minimum, a scheduler service and database to be running. To get full functionality, you also need to run a web server service. Deploying and maintaining these services may be a substantial barrier to entry for some organizations.
- The project is still pretty green. Though Airflow is already quite feature-rich, and has a vibrant community of developers, it’s still going through some growing pains. We’re excited to see how it develops as more organizations adopt and develop the project.
- The scheduler is not fully data-aware. Unlike Luigi, which has a built-in notion of the state of data sources and their targets, Airflow does not. Thus, Airflow may repeatedly process tasks that have already affected the state of a target artifact. This functionality can be implemented with custom Sensor and Operator classes, but it would be nice if Airflow was more data aware out of the box.
- Airflow is a poor solution for handling streaming data. Airflow was built to primarily perform scheduled batch jobs, which makes it a poor solution for tasks that fall outside of a batch-processing model. Thus if you’re dealing with continuously streaming data, you may need to look for another solution like Google DataFlow.
Execution logs and scheduler hangs
There are a number of online issues reporting that Airflow’s scheduler can hang during operation. Some explanations include problems with concurrency and queue size, external task priority when using the ExternalTaskSensor, and black magic. When using Airflow for the first time, we also experienced scheduler hangs. We tried a number of the solutions posed to remedy hanging issues, but with no success. Eventually we realized that because of an inconsistency with group permissions in our Puppet configuration, execution logs were not writable by the
airflow user that runs the scheduler and worker processes. Though the expected behavior would be that the scheduler fails with some verbosity, instead only the worker processes fail, and they do so silently. Thus tasks never complete, leaving them in a broken queued state. This behavior was not obviously accessible by inspecting any of the Airflow execution logs or stdout, but rather by checking in the
systemd logs. Thus, some key lessons learned from this experience were:
- Inadequate permissions of task execution logs can cause the scheduler to hang without any diagnostic information.
- When the Airflow scheduler is run as a service, task execution logs are insufficient to inspect the operation of the scheduler. Here,
journalctlis your friend!
“skipped” != “success”
Airflow offers the
LatestOnly operator, which provides a means for only processing those tasks whose execution dates are near the current date. This is helpful for use cases where an entire table will be truncated and replaced each time a task is run, in which case historical executions cause unneeded calculations (this is a particular concern when using BigQuery, which accrues charges per query). The
LatestOnly operator skips historical tasks, updating their status in the database to “skipped”. However, if a task depends on a historical task that has been skipped, that dependent task will often hang the scheduler. This is because the default behavior is for the dependent task to wait until the upstream task has a “success”, rather than a “skipped” state. Because the upstream task is complete, it will never update its state to “success”, at which point the dependent task never gets run.
DAG is active, avoid changing
Changing the name of a
DAG seems innocent enough, but beware! The operation of Airflow is strongly coupled to the metadata database, which holds the history of all
TaskInstances. Because the scheduler, workers, and web server use unique text identifiers to identify records in the database, changing these identifiers can cause collisions and strange scheduler behavior. The same goes for changing execution times and schedule intervals.
After starting the scheduler, you may need to turn it “On”
It turns out that just starting the scheduler is not enough. You also need to update a DAG’s state in the database so that the scheduler knows to include it in the pool of available DAGs to keep track of. You can do this via the CLI by running
airflow unpause DAG_ID,
or by clicking the little
On/Off switch in the Web UI. This step seems trivial, but you’d be surprised how often it’s overlooked!
The web server is a great debugging tool!
Each time the web server is restarted, it attempts to load all available DAGs from disc, compiling each DAG file along the way. This is a nice, fast check for any compilation errors when developing new DAGs, Operators, or Sensor. Additionally, once all DAGs have loaded without error, the Graph View provided by the Web UI (e.g. Figure 3.1, Top Subpanel) gives you a fast visual verification that you’ve correctly implemented task dependencies in your DAGs. Furthermore, although the CLI offers commands like
airflow run and
airflow trigger_dag, which update state of
DagRuns in the database, we generally find that it’s easier to manipulate these records through the Web UI while the scheduler is running. For example.
TaskInstances state in the Web UI makes Airflow “forget” the task was run, thereby causing the scheduler to add the it to the worker queue.
Speed up development with IPython
Another benefit of having DAGs defined as code is that it allows you to import Airflow objects into the IPython REPL. This way you can iteratively develop and try out things outside of the scheduler. In conjunction with the
%autoreload IPython magic — which reloads an object class in place each time it’s called — using IPython can dramatically speed up development time.
Don’t be afraid to touch the database
When developing new DAGs or Operators, you may find yourself in a scenario where mysterious states in the metadata database cause the scheduler behave strangely or hang. This can be frustrating, because you’re at the mercy of Airflow’s SQLAlchemy logic, which may or may not be editing the information or states that you expect. We found that in these scenarios manually editing the database records using good ole’ SQL can be quite effective.
Quizlet has been using Airflow for a few months now, and we couldn’t be happier. Given our current single-instance deployment, we’re able to accomplish an astonishing amount, supplying metrics on acquisition, retention, engagement, study habits, and revenue streams, just to name a few. And all this with only a couple of data scientists behind the scenes. The current success of Airflow has empowered us to further expand our deployment to handle more advanced tasks such as training our machine learning classifiers, calculating search indexes, running A/B tests and user targeting. Our current efforts involve the following milestones:
- migrating the metadata database to its own dedicated instance
- uploading and reading execution logs from GCS
- switching over from
LocalExecutorto a distributed queuing system such as the
- implementing resource queues to execute different types of tasks on dedicated hardware.
We’re excited to implement these updates to our WMS and can’t wait to see how expanding our data processing infrastructure improves Quizlet’s ability to provide the best learning experiences for all our users.