Going with the Flow: How Quizlet uses Apache Airflow to execute complex data processing pipelines
A Four-part Series
Data is the currency of the modern business. An increasingly-online world, connected by a seemingly limitless number of APIs has created a sea of data. Businesses now have the opportunity to glean from data deep insights about their customers’ needs, behaviors, and motives — if only they could wrangle, organize, and understand all that data!
As more businesses become data-driven, there is a growing need to carry out complex data processing pipelines that regularly extract data from various sources, transform that data into formats that facilitate business logic, and store the resulting artifacts in a way that facilitates product improvements and stakeholder decision making.
Quizlet’s relationship with data is no exception. With a fast-growing community of 25 million active users who log hundreds of millions of events daily, Quizlet needed a framework for orchestrating complex data processing pipelines, or data workflows as they’re often called. In particular, we needed a workflow management system (WMS) solution that could carry out a robust set of operations, while also being able to scale with the future growth of Quizlet.
Choosing and deploying a great WMS is a complex project, and we wanted to share our experiences searching for, and eventually deploying Apache Airflow as our WMS. The process is so multi-faceted, we found it difficult to fit it all into a single blog post! Thus, we introduce a 4-part series of blog posts aimed at sharing our insights, organized as follows.
- Part I of the series introduces and motivates the need for WMSs with an example data processing problem similar to the ones we often encounter here at Quizlet. We’ll refer back to this example throughout the series, as we believe that an end-to-end demonstration is helpful for explaining key concepts.
- In Part II, we present a wish list of features that we at Quizlet believe are essential for a WMS to meet our data processing needs. We then describe how we used this wish list to guide us through the landscape of available WMS projects, leading us to adopt Apache Airflow.
- Part III gives a detailed technical background on Airflow, including its key concepts and architecture, as we work through the example workflow introduced in Part I.
- In the final post we describe the initial Airflow deployment used here at Quizlet and provide some key learnings we gathered along the way. Finally, we wrap up with Quizlet’s future plans for Airflow and data workflows in general.
We hope that the series provides useful material for a wide range of readers, including those just beginning their research into WMS projects, to those readers who want an in-depth understanding of Airflow’s operation.
Many high fives go out to all the members of the Quizlet team who helped research and evaluate multiple workflow managers, deploy Airflow, and provided thoughtful comments on this series of posts. I’m looking at you Shane Mooney, Karen Sun, Amanda Baker, Miguel Flores, Tim Miller, Laura Oppenheimer, Amalia Nelson, and Andrew Sutherland!