Implementing Sagas using AWS Step Functions

Mario Bittencourt
SSENSE-TECH
Published in
10 min readFeb 26, 2021

Distributed systems are naturally complex. Part I of this SSENSE-TECH series demonstrated how transactions that were taken for granted, require a specific approach to be achieved in such systems.

I presented some of the challenges associated, advocated the use of Sagas to help us to bring back transaction support, and presented the two flavors of Sagas: Choreographed and Orchestrated.

In this article, I will defend the case on why AWS Step Functions (SF) can be a good option and introduce how to implement our sample domain using it. So let’s begin by taking a closer look at the inner workings of an orchestrated Saga.

Orchestrated Saga Building Blocks

At the heart of the orchestrated Saga is the SEC (Saga Execution Coordinator), which like the name implies, is responsible for coordinating the execution of the actions involved to deliver the expected functionality (ex. place an order).

An SEC maintains a definition of which actions should be executed, in which order they will be executed, and the compensating actions in case of failures. This can be represented as an acyclic directed graph (ADG) and implemented using a state machine pattern. Figure 1 represents one simple Saga.

Figure 1. A simple Saga.

The SEC persists the current state and uses that state to determine what to do next when receiving the replies from the commands it sends.

In the example above, the SEC instructs the system to allocate the item, keeps track of this, and upon receiving the response determines whether I ship the item or cancel the order.

Due to its central role, any implementation of an SEC should be:

  • Scalable

You should be able to start as many SEC executions as necessary to accommodate the business needs.

  • Resilient

Problems will happen and the compute node responsible for the SEC may need to be interrupted (ex. hardware problems, maintenance). You want to be able to resume from where it stopped without any additional issues.

  • Multiplexed

The implementation must be able to route the replies to previously sent commands to the right SEC instance as you will have many executions in parallel, for the same or different process.

  • Versioned

A Saga is a representation of a business process, and those evolve over time. It is important that the SEC respects the version of the process it started executing.

  • Ignorant of the business logic

The SEC is supposed to do the coordination, and must not contain domain logic, which should still be hosted at the services.

Although necessary, none of the aforementioned requirements provides direct business value. Developing and maintaining them can consume considerable effort that could be directed to delivering new customer-facing features.

When this happens, I usually recommend looking if there is a better alternative in the form of a non-intrusive framework or managed service that you can leverage. In the case of managed services, not only do they provide you with the inner workings of an SEC but also with the infrastructure needed to support your application.

In the cloud space, there are options available by the major players, such as AWS Step Functions, Google Workflows, or Azure Logic Apps. Since SSENSE currently uses AWS, I decided to leverage it in our projects.

Step Function Basics

Before we can define our Saga, let’s look at the basics of AWS Step Functions.

At its core, the Step Function provides a way to orchestrate tasks that need to be executed. Start by defining a workflow that will be followed.

There are two basic elements of any workflow:

  • States

Each state represents an action that will be performed by the Step Function. There are several types of states to provide flexibility for your workflow.

  • Transitions

After you perform an action defined by the state you will proceed to the next state. Some states allow the definition of loops, so the next state can be the same in a different iteration.

A workflow is defined using JSON following the Amazon State Language (ASL) specification. You can see there are many types of states that you can combine in your workflow, with the main ones listed below:

  • Task

Execute an action, which can be carried by integrating with a number of AWS services. Common examples can be invoking a Lambda function (which in turn can do pretty much anything), or sending a message via SQS, or calling an API backed by the API Gateway service.

AWS is periodically increasing the list of services that can be directly called. For the most up-to-date list, please check here.

In the previous fragment, we see there is one state called “Create Order in Pending” of type “Task” and it will call a Lambda function by specifying which one in the “Resource” field.

  • Choice

Allow you to perform branching in your workflow based on some condition — usually the result of a previous state. You can have many possible choices, including a default one if no direct condition matches.

The fragment shows a state that will evaluate the contents of a carrier variable and based on the values select what is the next state to transition to. The default option indicates if no specific options match the criteria we should transition to the “Canada Post” state.

  • Map

Allow iterating over a list performing a series of actions on each element. It is particularly useful when you need to implement a scatter-gather approach within your workflow.

For example, imagine your process needs to know how much to charge for the delivery and it depends on the weight of each package.

You could receive the list of packages and for each calculate the amount. In the end, you aggregate all the information received.

The fragment below illustrates how to do so with the Map state.

This state is particularly powerful because in practice what gets executed is another workflow that can be as complex as needed. You can see that the Iterator object follows the same structure as the workflow, with a StartAt element, a Next element, and so on.

If one or more states are reaching external resources it is important to control the concurrency of the iteration. If left unchecked you can cause a Denial of Service (DoS) on your own services or reach some sort of throttling cap.

By default, the concurrency level is set to 0, which means the Step Function will try to execute as many iterations in parallel as possible. If you want to have a sequential approach you can specify the concurrency to 1.

AWS will execute each state and you have control over the input it receives and the output it generates. In its simplest form, the output from a previous state will serve — or at least be available — for the next state as its input.

In practice, you have a high degree of freedom to manipulate the input and output through a series of modifiers that can be applied to the raw inputs or outputs produced by the execution of tasks. Figure 2 shows how the input, referred to as Raw Input, is manipulated before is actually passed to the AWS service. In a similar fashion, the output of the AWS service can be manipulated and passed to the next state.

Figure 2. Input and output manipulation.

In a simplified way, the InputPath and OutputPath act as filters, allowing you to select portions of the data to be used (InputPath) or exposed (OutputPath). The Parameters and ResultSelector can create new data structures by rearranging the portions of the data as needed.

For more details, this page contains an explanation of each of the modifiers. It is important to understand that there is a limit of 256KB for the input. If you happen to have a bigger payload you will have to resort to different ways to execute your workflow.

Since failure can be the result of the execution of any given Task, you can — and should — define what to do in this case. A common approach is to determine if you want to retry the execution or transition to a specific state in the case of failures.

You can specify what types of failures can trigger retries, how long to wait between failures, and even how many retries are to be attempted.

In the previous example, if you receive the error CarrierTimeout you will wait for 5 seconds before the first re-attempt. If it continues to fail, it will retry up to 4 more times, with a delay of 10, 20, 40, 80 seconds respectively.

For all other types of errors — States.ALL — you will transition to a different state to perform some compensating actions.

Our Domain

With the basics of Step Functions covered, let’s look back at our domain and start building a solution using it. In order to create your Saga using Step Function you first have to focus on defining what is the process to be followed.

In figure 3 we have a simplified flow illustrated, based on the domain first presented in our previous article.

As you can see we have the happy path, but also states that explicitly focus on when problems occur.

Figure 3. The workflow for our sample domain.

The ASL used to generate that can be seen here.

Let’s break this into smaller sections to discuss:

  1. Start the execution

The API receives the input from the outside system — the one placing the order — and starts the execution of the workflow.

The only relevant part is the specification of the first state, which is defined by the StartAt element.

2. Create the order in a pending state

Here this task interacts with a Lambda that represents the operations to be performed, such as validating the request and persisting the data in some sort of database. Note the output generated is a representation of the order created.

The focus here is on three parts:

  • We specify which parameters from the input are to be passed to the actual task. See the Parameters element.
  • The actual task will be carried out by the execution of a Lambda function specified by the Resource element which contains the ARN of the function.
  • The result of the state will consist of the original input and a new element order that will contain the output generated from the current state. This is specified by the ResultPath element. We will use this information in future states, for example, if we need to cancel or confirm the order.

3. Authenticate the card

This task is a Lambda that interfaces with the payment service, which processes the credit card’s information. The output of this is some token returned that will be used later.

This illustrates how since this service can be unresponsive, we have established a retry mechanism.

Here we have established two retry conditions, one based on standard errors (Lambda.ServiceException, Lambda.AWSLambdaException, and Lambda.SdkClientException) and another one using custom errors (Payment.GatewayUnavailable). Each one can have different retry details, as defined by the IntervalSeconds, MaxAttempts, and BackoffRate.

The Catch element informs us that if an error not listed in the previous Retry element happens, or we exhaust the number of retries we should transition to the state responsible for canceling the order.

4. Cancellation of the order

In the previous state, we can see the first branching in action. If the authentication was declined — for example, due to insufficient funds — instead of continuing with processing, we finish by marking the order with a proper state.

Note that we use the name order that was created in the first state, when it was in pending as input and replace it with an updated version, now canceled.

5. Reserve the item

This task will remove from the stock the item being requested. If we are unable to secure the item from the stock we will have to perform a compensating action to release the funds authorized in a previous state.

6. Capture the funds

This will interact again with the payment service to now capture the funds.

7. Dispatch the order

It will interact with the shipping service that is responsible for making sure the order is shipped to the customer.

Successful execution of the workflow looks like Figure 4.

Figure 4. Successful execution of the Saga.

Compensating actions

Any of the previous tasks could have an “unexpected” negative outcome. For example, if during the execution of the dispatching of the order something prevents it from happening — the item is damaged — you have to undo some of the tasks, such as refunding the customer the amount and removing the reserve from the item to the order.

We carried these by specifying the Catch elements in a way that the first one that fails will trigger a cascade of actions that will semantically undo the previous ones.

Figures 5,6,7, and 8 show examples of unsuccessful executions at various stages and their consequences.

Figure 5. Failure to authenticate the payment.

If we fail to authenticate the payment we proceed directly to mark the order as canceled.

Figure 6. Failed to reserve the item.

If we successfully authenticated the payment but were unable to reserve the item we have to release the authentication and then mark the order as canceled.

Figure 7. Capture the funds failed.

If the failure happens at the capture payment we proceed to release the item reservation and cancel the order.

Figure 8. Failed to ship the order.

Finally, if there is something that prevents the shipment we have to refund the item and mark the order as canceled.

It is important to stress that this example has been simplified as there are many other actions — and their corresponding compensating ones — in a real process. The main purpose of this is to stress the thought process and how to translate this into an implementation.

Scenes of the Next Chapter

So far, we have been able to see how our sample domain could be implemented using Step Functions. In the next and final part of this SSENSE-TECH series, I will revisit this implementation and present more advanced strategies that can be leveraged when dealing with large inputs, how idempotency is handled, and processes that depend on asynchronous third-party services and/or contain human intervention.

Editorial reviews by Deanna Chow, Liela Touré, and Pablo Martinez.

--

--