Implementing Sagas using AWS Step Functions: Advanced Concepts

Published in

SSENSE-TECH

10 min readApr 23, 2021

AWS Step Functions provide a managed solution to orchestrate microservices. Part I of this SSENSE-TECH series presented the Saga pattern as the recommended approach to handle its complexity and Part II introduced us to how to leverage Step Functions to model an orchestrated Saga with our retail domain example.

The next steps, covered in this article, are aimed at addressing more advanced use cases and discussing important aspects of the Step Functions you should be aware of when you decide to leverage it to your application.

Not All Sagas Are the Same

AWS Step Functions come in two flavors: Standard and Express. While the ASL is the same for both, some features are only available in the Standard or have different limitations when compared with the Express.

A detailed comparison can be found here, but the main differences revolve around the maximum duration of the state machine, pricing, and execution guarantees.

If my Saga has the following characteristics, I choose the Standard default type:

Can last longer than 5 minutes
Needs to execute tasks that on their own are not idempotent
Benefit/demand audit capabilities

On the other hand, if your Saga does not have those characteristics you are free to choose the Express type. At the very least, it will provide you with the orchestration features I have been discussing, and at a lower cost.

Idempotent Execution

A common concern for any message-based distributed system is how to make it idempotent. In this pursuit, one problem to be addressed is what to do when I receive the same message twice. An approach is to detect that the message is duplicated and simply drop it before it gets processed.

Step Functions help you with that by providing, at least with the Standard type of state machines, a mechanism to prevent the same request to be processed twice. This comes as an exactly-once execution guarantee.

How Does it Work?

When you want to execute a Step Function you need to call a StartExecution API providing the ARN of the state machine you have created, an input, and optionally provide a name for that execution. The code snippet below shows an example of starting the state machine defined in Part II of the article with the cart contents.

Successful execution of the StartExecution API call will return to you

Once you start the execution of the state machine, AWS will automatically create a unique name and associate it with this execution. This is what can be seen in the console when you look for the previous executions and can inspect what happened.

Figure 1. Name automatically created after *StartExecution* call.

The StartExecution is idempotent with Step Functions when using the name and the input. This means that if you try to start the same state machine with the same name and content you will not get a duplicate execution. The code below shows the use of the parameter name for you to define the execution name to be used.

The way AWS reacts to this depends if the original execution is still happening or has already finished.

If the state machine execution has begun but has not yet finished and you attempt to start a new execution with the same name and input again, AWS will inform you of the same output when the first execution has begun.

If the initial execution has already finished and you attempt it again you will receive an ExecutionAlreadyExists error.

How to Leverage the Idempotent Execution?

The way you can leverage this will be influenced by your exact domain and use case, but one way to do so is to craft a name that is unique to the context of your execution. This way, if it receives the same name and input you protect yourself from duplication.

In our retail example, imagine that as the customers interact with the site, by adding and removing items from their shopping cart, there is a single identifier associated with the cart. You could use that CartId to create the name of your state machine.

With a simple change like this you prevent the execution of the same request. If you can, try to model your Saga to leverage this functionality as it prevents failures or more complex compensating actions due to processing a duplicate request.

Limitations

There are a few limitations to this exactly-once execution guarantee:

It is only available to the Standard Step Function. AWS offers an Express Workflow that does not offer this capability.
It is only guaranteed for 90 days from the first execution. If you attempt the same name and input after this period AWS will accept and start a new execution.
It is restricted to the same AWS account, region, and state machine. If any of those change you will be able to have multiple executions. Normally, this should only be a concern if you have an active-active multi-region setup but it is important to be aware no matter what.

Handling Long Processes

Sagas can be used to model any process that benefits or demands a transactional context, and while most examples we see tend to reference processes that are expected to finish in a few seconds, sometimes this is not the case.

If you have a situation that may require processing large amounts of data or that involves human interaction, you may have to handle a process that will take a long time to conclude.

The Standard Step Function supports executions that can take up to one year to conclude, so from the service itself you have support for those long processes without any additional change. What is left is to address how you model your process to factor that the execution of one state can take long so you can pause your state machine while you wait for it to finish.

Let’s look at one example related to our domain. As part of placing an Order we have to locate the items in the warehouse and package them before actually shipping them. These tasks involve human interaction and may take quite some time from beginning to end. In this case, you will want to notify the Warehouse of the new Order but then stop the processing of the Saga and wait for news from the Warehouse. Figure 2 illustrates this.

Figure 2. A Saga using a callback pattern to allow a long process and a stop/resume approach.

In this approach, you create a state that contacts the Shipping service from the Warehouse and stops the transition to the next state until it receives a response. This response can be successful or failure which would trigger the appropriate compensation like before.

Step Functions support an integration pattern referred to as callback with task token. To use it, you configure your task by adding the suffix .waitForTaskToken. This will generate a token that you can use to restart the execution from the state machine when it is appropriate.

In this configuration, you see that we have access to the token via the $$.Task.Token syntax. The $$ refers to a special context object created by AWS for you.

In our example, the function referred to in the task definition uses SQS to send a message (a command) to ship the Order to the Shipping service and pass the token with it. When the Shipping service finishes processing the Order it will send the reply back via SQS as well, effectively implementing a reply channel.

At this moment, the consumer of this reply will resume the execution of the Saga by calling SendTaskSuccess or SendTaskFailure. In either case, you must specify the task token and it will transition to the next state. The following snippet of code shows a simplistic implementation of a lambda that would receive the reply and continue the execution.

Please note that if you have full control over the downstream service you could consider even calling SendTaskSuccess or SendTaskFailure directly from the service. I would recommend only doing so if the services are in the same domain to avoid unnecessary coupling between them.

Special Consideration for Handling Input

I want to discuss two aspects that we have to be mindful of when using Step Functions:

Exposing sensitive data:

Standard Step Functions register the execution of each state machine for 90 days. This is very handy because it allows you to not only have a visual representation of the actual execution but to be able to inspect what was available as input and output by each state.

This facility comes with a potential problem which is to make available the data that is returned from a state available for anyone that can see this execution trace.

Imagine you have the situation illustrated in Figure 3.

Figure 3. The output from one state contains sensitive information.

Input bigger than 256KB:

Currently, all types of Step Function are limited to 256KB payload as part of its input. This means you can’t start the execution of a state machine if the input is bigger than that. The same limitation takes place if the output from a state — hence input of the next — is bigger than the same limit.

While for many use cases this is not a problem, it is still something that you have to be mindful of as in some use cases this is indeed a possibility.

Imagine you have to place a B2B order that could contain hundreds, if not thousands of different items.

Both situations have similar solutions. The simplest way to avoid this limitation or to expose the data is to store the sensitive or big portion of the data in a secure and equally available managed service.

If you are concerned with the security aspect more than the size you could leverage DynamoDB to store the data — encrypted or not — and just pass the key used to retrieve the information as the output of the state.

Figure 4. Using S3/DynamoDB to store big or sensitive data.

This way, if you need to access the information in the next — or any subsequent state — you simply have to use the key to retrieve the actual data to be manipulated. If that information is ephemeral, you can even leverage DynamoDBs TTL feature to automatically have it removed.

On the other hand, if you are more concerned with the size, S3 would be the service I would consider and similarly to the previous solution, pass the bucket and key to access the file you need. You can establish a policy to automatically remove the S3 objects after a predetermined time.

Divide and Conquer

I have encountered the need to execute the same number of tasks as part of a loop. Normally, this would mean encapsulating all those tasks in a Lambda and having a simple loop inside the Lambda.

While there is no problem with that, Step Functions have a State type called Map that allows you to effectively define a sub-state machine that can be executed in a loop.

This facility comes in handy if you consider all the power of state machines, handling failures with retry and catch capabilities. To make things more interesting, it can execute the iterations in parallel and handles combining the output from all iterations for you at the end of the execution. It means that you can offload this logic from your code to AWS.

Imagine as part of our retail example that you can have the order shipped in separate packages. As part of your Saga, this means that each package needs to get a separate tracking number from the carrier. Because the carrier is a third-party service it is currently limited to one package per API call.

The snippet below shows the definition of this state. You pass the list of packages for the lambda that interacts with the carrier service.

The current configuration sets a concurrency level of 3, meaning that it will execute at most 3 iterations at the same time. By default the Map will try to execute as many iterations concurrently as possible (MaxConcurrency: 0), so it is advisable to evaluate what downstream resources you will be contacting to avoid causing unnecessary load to those resources.

In the end, the result of each iteration will be added to an array that will be available for the next state outside the Map.

Note that the Map definition follows the same structure as any state machine. This means it can contain several other states inside, including other Map states! In the example provided, we contact the external service to retrieve the tracking code and proceed to add a local code based on the response.

That’s Not All Folks

I have been using Step Functions to address some of the complexities inherent in distributed systems. While the results have been positive there are still areas I want to investigate further or that can still be improved by AWS.

Much like serverless, adopting Sagas/Step Functions requires adjusting how you approach modeling and developing your systems. Right now the developer experience is something that is not as mature or well polished as I would like it to be. The serverless framework, localstack, and AWS Serverless Application Model (SAM) are some of the alternatives, but as of today, each one seems to fall short at feature parity, debug capability or time to deploy.

Leveraging feature flags can be trickier with state machines, so handling this effectively is also on my to-do list.

Finally, I want to evolve the modeling to leverage a combination of Express and Standard workflows to optimize cost and latency without jeopardizing the visibility or idempotent guarantees.