Beyond CRON: an introduction to Workflow Management Systems

Part One of a Four-part Series

Dustin Stansbury
5 min readMay 2, 2017

Data-driven companies like Quizlet often hinge their business intelligence and product development on the execution of complex data pipelines. These pipelines are often referred to as data workflows, a term that can be somewhat opaque in that workflows are not limited to one specific definition and do not perform a specific set of functions per se. To better demonstrate workflows in practice, and to motivate the need for workflow management systems (WMS), we pose an example data-processing problem much like the ones we often encounter here at Quizlet.

An Example Data Processing Problem

Imagine that you work for a US online gaming company that offers a number of different games, each of which can be installed from multiple international online app stores (App Stores A, B, and C). The company operates on a “freemium” revenue model: each game is free to download and play, but free play is supported by advertisements. Additional upgrade features like the removal of advertisements, infinite lives, etc., can be purchased from each of the app stores. Thus your revenue is a combination of funds brought in by both ads and upgrades. In order to better understand the efficacy of this revenue model, your company’s financial department requests a dashboard that provides the following daily metrics:

  • total daily revenue
  • a revenue forecast for the following day

Seems easy enough. You simply need to combine daily revenue reports from your partner advertising network and each of the app stores. Also, it turns out you’ve developed a fancy forecasting model that can give accurate forecasts of the next day’s revenue, the only requirement is that the model makes predictions based on revenues reported from the previous 7 days.

Figure 1.1: An Example Workflow: Reporting and Predicting Online Gaming Revenue
Each node in the graph represents a task that needs to be performed. Tasks are performed in particular order, organized roughly from left to right. Arrows indicate task dependencies where the direction indicates parent-child relationships. The dashed outline highlights independent tasks that can be executed simultaneously.

After you do some research, you realize that obtaining the total revenue metric is surprisingly involved: Although each of the app stores provides an API that returns revenue reports in nicely-formatted JSON, each app store reports revenue in their local currency (based on the location of the app store’s data warehouse), rather than in US dollars (USD). Thus, for each app store you must execute tasks that a) extract revenue data from each API, b) transform the resulting data from JSON into your company’s data format, and c) performs a daily currency conversions from the local currency to USD. You decide to add a set of tasks that pull foreign conversion rates from another external API and use this data to convert revenue in foreign currency amounts to USD. It turns out, however, that the foreign conversion rate API supplies conversion rates for different countries at different times of day. Therefore you’ll also need to execute a task that blocks the currency conversion tasks until the conversion rates needed for all app stores become available.

On the ads revenue side of the equation, you discover that although your ads partner does provide revenue data in USD (whoohoo!), it only offers reports in the form of daily spreadsheet uploads from a secure file store (gah!). Thus you’ll need to execute a set of tasks that a) extracts the spreadsheets from the file store and b) transforms data from the spreadsheets into your company’s data format.

Once all JSON and spreadsheet data are extracted, converted, and transformed, you’ll need to execute some tasks that combine the resulting data sets and append the results to a historical revenue dataset. The historical data can then used for forecasting future revenue and, finally, building that darn dashboard (phew!).

This is a typical example of a data workflow problem, and is visualized in Figure 1. In workflows like this you need to regularly execute multiple tasks that perform a multitude of different functions on various types of data. All of these tasks can vary wildly in their start times and execution duration, and in general, the workflow will require complex dependency relationships amongst the tasks.

To complicate matters further, all workflows must operate in the presence of Murphy’s Law: whether it be due to servers dying, APIs failing, loss of network connectivity, unforeseen errors in the data, or sharks attacking the internet, there will always be unforeseen failures at each step of the workflow.

Now, we could manually run each of these tasks by hand, verifying their success and accuracy at each step. However, that would not only quickly become quite boring, but it’s an inefficient use of an analyst’s time when we have computers that could be doing all this for us. This is where a WMS comes in. We want a system that can regularly execute each of the tasks in a workflow, while considering task dependency and scheduling constraints, and also be able to handle unexpected malfunctions along the way.

“Why not just use CRON?”

The tried-and-true method for executing regularly-scheduled tasks like the ones described above is to use CRON. The CRON approach is straight-forward, battle-tested, and works well for simple workflows that have few task dependencies. However, if one were to use CRON for a workflow like the one in our example, they would quickly realize their folly.

First of all, CRON does a poor job at handling task dependencies like those demonstrated by the arrows in Figure 1. Furthermore if a task fails, CRON provides no strategy for retrying that task, or for that matter retrying any of the tasks that depend on it. CRON also generates limited metadata about task landing times, execution durations, and failures of scheduled tasks. Thus, workflow debugging and maintenance requires engineering time and knowledge of the company’s tech stack that grows exponentially with the complexity of the workflow. In addition, simple management operations like inspecting and interacting with the CRON processes are difficult for any data stakeholder who lacks a system administration skill set. CRON is an open framework with fairly loose guidelines around its use, which can be nice in that it’s unrestricted. However, a lack of standard guidelines can also limit the efficiency of code review, determining ownership of workflows, and can result in heterogeneous methodologies across engineers and teams. Thus for a reasonable-sized workflow project, we determined that CRON turns out to be an inadequate solution.

Beyond CRON

Quizlet has historically used CRON for a variety of regularly-scheduled tasks, but we eventually ran into the limitations like the ones highlighted above. As we planned our transition beyond CRON, we found ourselves asking the age-old startup question: “do we build something better in house, or adopt an existing technology?” We had a lot of ideas for what a great workflow management system would look like, and were excited about the prospect of building a new set of custom tools in house. However, being 60-person, Series A start-up, we wanted to avoid allocating limited resources for what could be multiple quarters in order to develop a new technology stack — when there are good options already available. Additionally, many of Quizlet’s data needs were immediate. So, we set out to find the best current workflow management technology.

The next post in this series covers how we at Quizlet developed a set of guidelines that we call our WMS “wish list”, and how we used this list to choose Apache Airflow as our WMS of choice.

--

--