AWS Simple Workflow to Step Functions: Serverless Orchestration for a Business-Critical Workflow
At Course Hero, students upload hundreds of thousands of documents to our platform every day. Before each document can be processed and published onto our site, we run an automated workflow that must:
- Convert the document to various, standard file types
- Extract various metadata and features from the document content
- Run various ML models to score and classify things like gibberish, semantic quality, document type, etc
- Pass various compliance checks
- … and more!
This workflow is often referred to internally as the “lifeblood” of Course Hero, necessary for processing content onto our platform for a multitude of downstream processes and essential to aggregating user-generated content. My team is responsible for this workflow and each of its components, ensuring that it is scalable, parallelized, and orchestrated properly. Additionally, just as importantly, we often make updates to this workflow to solve for new business objectives or to incorporate new ML models. When doing so, the maintainability and testability of the workflow is critical in order to move fast and deploy changes safely.
In the past year, we have planned and executed the migration of this workflow from AWS Simple Workflow (SWF) to AWS Step Functions (SFN). This project aimed to improve our developer local setup, increase automated test capabilities, and simplify the maintainability of orchestrating this workflow. So far, we have seen a rough estimate of between 20–30% improvement in workflow performance metrics and an estimated 30% improvement in developer velocity and quality of life after this migration. We measured workflow performance through statistics on workflow execution time. We measured developer velocity using Agile metrics such as lower story point complexity of tickets. Finally, we measured developer quality of life as level of satisfaction reported after completing various tasks related to changes to the workflow. Of course, as we will discuss, there are always trade-offs in every technological decision, but most of the metrics we prioritize on our team favor SFN by far.
SWF (and its problems)
What were the driving reasons behind this project, and what specifically was it about our architecture in SWF that necessitated the migration? At a high level, in SWF, there are a few important components:
- Activity workers — An activity worker is a program that receives activity tasks, performs them, and provides results back.
- Decider — The decider effectively coordinates the workflow. It schedules activity tasks, provides input data to the activity workers, processes events that arrive while the workflow is in progress, and ultimately ends (or closes) the workflow when the objective has been completed.
Let’s focus mainly on these two major components and walk through an example of how this works. I mentioned before that Course Hero runs various ML models on the documents that are uploaded. Let’s consider the gibberish detection model for this example. The gibberish detection model, like most ML models, need to be primarily run on text (features) as the input. We can define this task of “run the gibberish detection model on the document’s text” as an activity task (or step) in our workflow. This means that having the document text available before this step runs is a required dependency. To represent this in our workflow, we must have a previous “parent” step that performs the activity task of “extract text from the document”. The result is that our decider will contain some logic that interacts with the SWF API that says “Schedule the gibberish detection model step after the extract-text step has completed successfully”.
While this gibberish detection model example seems simple enough, there is a lot of hidden complexity already. What if the extract text step fails because the file is unable to be converted to a text? What if the step times out? What if the process completes, but there just happens to be no text in the document? Each of these edge cases requires logic to be built in the decider to be able to “decide” or handle what to do at the workflow level in each scenario. The resulting logic may be “retry the extract-step 5 times and if they all fail, fail the workflow instead. However, if the step just times out, run this fallback step that has more memory/CPU and a longer configured timeout to see if the step will succeed then, and if it does, proceed back to gibberish as if the original step had succeeded”. We can observe that even with just two simple steps, the orchestration logic starts to become complex and easy to break, and our workflow has 20+ steps where some can run in parallel, some have multiple dependencies before they can be run, some have fallback routes, etc.
Another issue with SWF is that even when running this workflow locally, we still had a dependency on SWF to interact with their API. This required our local setups to be configured with the right permissions and AWS environment, and it also prevented us from mocking out the AWS API in an easy way.
We represented our SWF workflow as a directed acyclic graph (DAG), where the edges represented dependencies and the vertices represented activity tasks.
SFN
In SFN, we fundamentally changed how we modeled the workflow; instead of using a DAG, SFN requires the workflows to be finite state machines. This means that each step is represented by a “state”, and each state’s output is used as input to its next state. Each state must either transition to a single next state, or it can be an end state (this is a bit of an oversimplification for reasons we will cover shortly). In SFN, AWS handles automatic retry handling, triggering and tracking for each workflow step, and ensuring steps are executed in the correct order. Remember that decider logic we talked about in SWF? It is now abstracted away by AWS, and you instead write a few simpler configurations in what is called a “state machine definition”. On top of that, AWS SFN provides a local docker image that you can use to run SFN, meaning you can point your API calls to the local container and eliminate the dependency on AWS when running automated tests. Let’s dive into how to represent workflows in SFN vs in SWF.
You might be wondering what the decider logic now looks like. I mentioned before that the orchestration is now abstracted and represented by a state machine definition (which is configured in JSON-like language). Here’s a simple example of a state machine definition:
{
"Comment": "Gibberish Detection Example",
"StartAt": "Extract Text",
"States": {
"Extract Text": {
"Type": "Task",
"Resource": "arn:aws:states:us-east-1:123456789012:activity:extract-text",
"Catch": [
{
"ErrorEquals": ["States. TaskFailed"],
"Next": "Fallback"
}
],
"Next": "Gibberish Detection"
},
"Gibberish Detection": {
"Type": "Pass",
"End": true
},
"Fallback": {
"Type": "Pass",
"End": true
}
}
}
We can already observe that the retry/fallback logic is nicely simplified into a few configurations, and the dependency management is also represented in the definition as well.
The differences between a DAG and a finite state machine may not be entirely clear, so let’s illustrate some of the differences with a visual walkthrough.
From diagram 3, the illustration shows that a DAG is more powerful and flexible than a state machine in defining dependencies, which shows the trade-off in capability vs complexity in the technologies.
I’d like to actually walk through various considerations of the decision-making process we used to compare/contrast SWF and SFN.
We can see that almost all the factors we considered favored SFN, with the exception of the lower capability in terms of configuring workflow logic that we covered.
Migration
In order to support this migration, we decided to:
- Build a new Go microservice to handle the state machine definitions and the API to start new executions, which aligns with our goals of agile development and improved automated testability. With this setup, we are able to easily write end-to-end workflow tests that mock out activities to ensure the execution proceeds as expected and the correct steps are run comprehensively and in the right order.
- Integrate the Step Functions Local Docker image to be run inside our Gitlab CI/CD build pipeline, requiring the automated tests to pass in the build. Furthermore, in our local developer setup, we are able to run the same image in a k8s pod and configure the code to point to that pod with port-forwarding, allowing us to easily make changes to a local state machine definition and test out the changes.
As part of our deployment plan, we have started with a feature flag allowing 10% traffic migrated to SFN from SWF, and we are closely monitoring the performance and metrics before gradually increasing the load. With the feature flag, we are able to disable the feature almost instantaneously, diverting traffic back to SWF. We use DataDog to create dashboards and monitors to compile, display, and alert on AWS SFN metrics such as # activities failed, avg execution time, schedule-to-start time, etc. So far post-migration, we have seen only two notable downsides:
- Limitation of the state machine paradigm compared to a DAG, which required us to refactor the workflow dependencies slightly to optimize for certain steps running sooner and for other steps to run in parallel later, as illustrated in diagram 3.
- Minor discrepancies in the AWS local docker image that cause some errors which deviate from the AWS production ecosystem. We have since created AWS support cases for these issues, with one being confirmed by their team and to be fixed in the next release and the other still being investigated. Neither of these have been blockers to our release, although they served as obstacles in our local development and resulted in sporadic issues in the Gitlab pipelines (which use the local docker image for testing purposes).
In terms of maintainability, developer productivity, automated testing capability, cost, and code complexity, we have seen fairly significant improvements all around.
Conclusion
The journey from SWF to SFN has been a large-scale improvement to our tech stack in one of Course Hero’s most important workflows. While it was written in SWF, developers (especially those on other Course Hero teams) often treated it as a black box, intimidated by making changes that could potentially impact a system that was overly complex. With SFN, we hoped to demystify the orchestration and inner-workings of the flow. So far, it has been a definitive success, and we are always looking for engineers to join us in improving our tech stack and architecture!