The Good, The Bad, and The Ugly: Reflections on AWS Step Functions One Year in Production Later

Mario Bittencourt
SSENSE-TECH
Published in
11 min readDec 10, 2021

For over one year now, the SSENSE Tech Team has been employing AWS Step Functions as part of our software development strategy. For the most part this has rolled out without any issue. As a Principal Architect, the last year has allowed me to have an extended first-hand experience gathering feedback from various teams throughout the company on what they have encountered while using such services.

In this article, I’ll provide a summary of our learnings at SSENSE; what worked, what didn’t, and — in my opinion — what should be improved in the ecosystem of the service itself.

The Good

AWS presents Step Functions as a “microservices orchestrator” or a “low-code visual workflow service”. I started recommending it as a viable candidate whenever at least one of the following characteristics happen:

  • The functionality we want to achieve is spread among several distinct distributed components (services, systems);
  • The functionality depends on human interaction at some point in time;
  • Some of the services we depend on can take a long time (hours, days) to complete their execution.

I already wrote other articles[1][2] on using Step Functions as an orchestrator for a Saga, and this AWS service delivers on what has been promised.

Retry on Errors

One of the simplest/out-of-the-box features we have been leveraging is the retry capability on failures. Anyone working with distributed systems should be aware of the fallacies of distributed computing, especially the first one: the network is reliable.

This is more relevant today than ever with cloud computing where computing nodes come and go all the time for various reasons.

Developers, even those more experienced, can easily fall into the trap of just considering the happy path, making error handling an afterthought, which is harder to add later.

If you leverage the Step Functions approach, you would model the flow of execution so that adding a retry is just a matter of updating the workflow and introducing a retry statement based on the error type found.

Using bulti-in retry capabilities.

No more changes to your application code to control the number of retries or handle the backoff mechanism. Just don’t forget that not all errors should be retried, and even if they are, you have to plan what to do when the acceptable number of retries has been exhausted.

Concurrent Execution

In some of our use cases, we reach a point where we have to execute the same operation multiple times, each for a subset of the input data. For example, imagine that when a customer is placing an order we have to secure the inventory for all items before continuing with the checkout.

Figure 1. Sequential execution.

Doing that sequentially will increase the time until the customer receives feedback of the successful or unsuccessful execution.

If you are already using Step Functions, adding concurrent executions can be as easy as isolating the parts in a Map state.

Figure 2. Concurrent execution.

This is even more powerful if you consider pairing it with a custom retry mechanism for error handling.

Figure 3. A Map State with Retries.

Execute Only Once

The standard (default) Step Functions type provides you 90-day idempotency if the name of the execution and input of a given workflow are the same.

Imagine the case where your client (another service) starts the execution of a specific workflow, and before it gets the confirmation that the execution has begun, the connection drops.

Figure 4. A successful execution that was not acknowledged by the client.

Without this feature, you would either have to detect the additional attempt yourself — like the example illustrated in figure 5 — or end up with a duplicate execution.

Figure 5. A simple duplication detection scheme.

With Step Functions, if you started the workflow with a name that reflects the unique context of the execution you get this control without having to add any more code. Figure 6 illustrates the first execution attempt, where we start the workflow using the name create-shipping-order-X, where X is the order identifier.

Figure 6. The first invocation executes.

A second attempt with the same payload and the same name would generate the error illustrated in figure 7.

Figure 7. Using the same name for the execution and the same input would trigger the error. No duplicate execution.

The Bad

As I interacted with the teams or tried to model some of the use cases, I found some aspects that either felt limiting, not clear at first, or required some course correction on the developers’ mindset.

Limit of 25K Events Per History

One of the nice aspects of the standard Step Functions is that it stores the execution history of a workflow for 90 days. The execution history is a collection of “events” like the example below:

Figure 8. Standard events for an execution history with lambda state.

As you can see, for a lambda-backed state you can expect 5 events consisting of TaskStateEntered, LambdaFunctionScheduled, LambdaFunctionStarted, LambdaFunctionSucceeded, and TaskStateExited.

This means that the 25K limit translates to actually less than 5K state executions, especially if you factor in that some executions may be retried, effectively multiplying the number of events generated. While for most use cases this number would still be enough to not be of concern, I have found that for situations where we had to process a large number of items — as in more than 1000 — or a workflow with 15+ steps, we were dangerously close to reaching the limit.

If this happens, the execution of the Step Functions would be stopped with an error and you would have to solve the problem manually, as some steps would have been executed but not all.

To preemptively handle a situation like this, you can:

  • Inquire about the maximum number of items that will need to be iterated;
  • Add a monitor to allow you to fail before even beginning the execution of the workflow, if this number is bigger than the maximum acceptable;
  • Break the execution to start as a new execution.

Imagine you have to execute 5 steps for a list of 1 000 items. You could model your workflow like the one illustrated in figure 9.

Figure 9. Breaking an execution with multiple items into different Step Functions.

As you can see, there are solutions, but they make you go out of your way to satisfy this hard limitation. At this point, I wish that the 25K limit was higher, or that it at least reflected the state execution-only, as opposed to all these pre and post activities.

Everything is a State

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” — Abraham Maslow

When a new tool/language/pattern appears and we get excited with all the possible ways it can be leveraged. All of a sudden we may start abusing it, trying to use it where it is not needed, bringing additional complexity and/or cost. Step Functions are no exception to this rule!

One way that I approach modeling the Step Functions is first looking at a diagram that describes the process we need to comply with. In most cases, this means looking at a simplified BPMN diagram like the one illustrated in figure 10.

Figure 10. Process for adding a product to the Warehouse. Simplified for space reasons.

A very straightforward way of looking at it, is to translate the BPMN in a 1:1 fashion, generating the following workflow:

Figure 11. Workflow for the process using a 1:1 mapping.

This approach typically has the following characteristics:

  • Higher Cognitive Load

As a developer, you tend to have a mental model of what is going on so you can understand the execution flow and where to make changes. Once you start breaking it down into small(er) pieces there comes a point where this can play against you because of the overhead added on those “artificial” boundaries represented by each state input/output.

  • Higher Execution Cost

Those steps are backed by Lambdas, which means that having many small lambdas can start to cost more due to the warm-up time being higher than the actual executable code. Add to that the fact that each state transition costs something and those amounts can add up pretty quickly.

  • Higher Execution Time

If you look at the execution history of a Step Function there are a set of activities that happen pre and post-execution of the Lambda itself. All those activities add to the total execution time. For long-running processes, this may not be critical but should not be forgotten, especially for synchronous executions.

To be clear, there is nothing wrong with the workflow as is, especially in this contrived example, but as you have more and more tasks in your process the overhead can start to offset the benefits. My advice is to take a step back, start with a 1:1 model based on the BPMN, and group the states after asking yourself the following questions:

  • Is the state performing a write operation?
  • Is the state performing a read-only operation?
  • Is the state performing a read-only operation that is costly (time/money) or prone to too many failures?

My rule of thumb is that if a state is performing a write operation it should be in its own state because you want to leverage the retry capabilities in case the resource is temporarily unavailable. For example, a database is under load or even a service unresponsive.

Now, if you have read-only activities that precede that write, you may combine them in the same state, especially if they are cheap to execute. If the write operation fails and you have to repeat the entire execution, the impact should not be significant. See the example below:

Figure 12. Grouping the read operations with the writes that depend on it.

The Add Product task depends on enriching the input with the size information that comes from an external service Catalog. I combined them both as obtaining the size is idempotent and, at least in this example, a reliable and cheap operation to be executed.

On the other hand, if one or more of those read operations are expensive, for example reaching a legacy system or third-party service that imposes a cap on the number of requests, you may want to break it down into separate states. This way if the write operation fails you would not have to perform the entire set of calls again.

Figure 13. The duty classification is provided by an expensive operation so it was kept separate.

In the end, make sure that you are using Step Functions if your use case can benefit from the features it provides.

Out-of-Order Execution

A common workflow has states that specify Lambdas will perform the compute tasks you expect: contact additional services, manipulate the result, and save to some persistence medium. Since Lambda functions have a cold/warm cycle, an out-of-order execution can happen when two events associated with the same resource are received too close to each other.

Figure 14. Out-of-order execution when the cold start of a Lambda inside Step Functions happens for events of the same entity.

To be fair, this is not exclusive to Step Functions but it is an aspect that is not so evident and can be easily overlooked.

Unfortunately, there is no universal solution, but one approach is to serialize the executions based on the entity that would be mutated. For example, if your process handles changes to customer orders, you want to make the changes to the same OrderId to happen one after the other.

You could achieve this by using SQS FIFO with the OrderId as the MessageGroupId and only moving to the next event if no other modification is underway to that same order.

Figure 15. A simple concurrency control to limit the execution of an update to the same entity.

This solution does not come without its drawbacks, so make sure to evaluate your use case and see if it is worth adding this complexity or only adding a detection/compensation action if the out-of-order happens.

An improvement for AWS would be to provide this concurrency control as a feature, enabling a startExecution call to receive additional parameters to determine if it should reject the request because another logical operation is still in progress for the same workflow.

The Ugly

Here are the parts that I feel are suboptimal when using Step Functions and hopefully can be addressed as the service continues to evolve and receive new features.

Error Handling

Your state executions will fail and adding retries or catches for those errors is something that you are going to be doing, or at least should be doing, as you develop your workflow.

If on one hand the error handling primitives are very powerful, adding them is a tedious job, especially if you want to have common handling for the same types of errors.

Figure 16. Repetitive blocks of error handling.

It would be helpful if there were a way to create default and named policies that could be referenced throughout your workflow allowing you to define them in one place and re-use them as needed.

Developer Experience

As a developer, having a robust and responsive development environment is key. Traditionally this means being able to run code locally, therefore creating a short feedback cycle. To help you achieve this, the tooling evolved a lot over the years and container-based solutions, such as docker, attempt to address the infamous “works in my machine” scenario.

Developing for the Cloud comes with a mentality shift as you can’t run the entire AWS ecosystem locally. Leveraging mocks for those AWS services generally helps, especially if your code uses some abstractions and can make use of dependency injection to replace real calls with mocked ones. So, what does a local development environment for Step Functions look like?

Essentially you have two solutions:

  • Serverless Framework + AWS Localstack
  • AWS Local Step Functions + SAM

My first complaint is that neither solution provides a robust experience and requires significant work in the setup and maintenance to keep it working.

Serverless Framework

Anyone working with Serverless Framework has probably gone through having a broken environment because the newest release had a backward-incompatible change. Additionally, running the Step Functions locally depends on third-party plug-ins that do not offer service parity with their AWS counterpart.

AWS Local

AWS offers a way to run Step Functions locally but it is not a straightforward, 1-step process as you would expect — or deserve. Similar to the serverless alternative it also does not have full parity, missing a way of running sync express workflows.

Here, ideally, we would see AWS providing an integrated toolset that would take your SAM or CDK-based infrastructure definition, enable you to run the workflow + Lambdas locally, with an option to deploy to a cloud sandbox, for workflows where you need to interact with other AWS services directly.

Keep Walking

Despite the not-so-great items listed in this article, I still believe that Step Functions are a good option and should be considered whenever their core features can help you shift some responsibilities from your code to the infrastructure.

AWS frequently releases new features to its services and Step Functions is no exception, so I believe things are going to get better over time.

This article could not have been possible without the contributions of the SSENSE TECH Teams that shared their experiences throughout this journey.

Editorial reviews by Deanna Chow, Liela Touré & Pablo Martinez

Want to work with us? Click here to see all open positions at SSENSE!

--

--