Handling Complexity: Using Sagas to Provide Transactional Support for Distributed Systems

Published in

SSENSE-TECH

11 min readJan 29, 2021

Distributed systems, especially with the use of the Microservice Architecture, are the new normal. Long gone are the days where we could take for granted the atomic nature of using a single relational database and monolithic system to provide our users with the functionality they expect.

In this microservice world, you trade the atomicity of a single database and single execution context, for a distributed execution relying on remote procedure calls or message exchanges.

Unfortunately, those that have taken this path have often overlooked that transactional support is lost in the process, and ultimately had to deal with this the hard way: having to apply fixes on production to address strange and hard to debug inconsistencies.

This is the first of a series of articles that aims to highlight the challenges associated with complex distributed systems, how Sagas can help solve some of those problems, and how to implement a solution using AWS’ Step Functions.

Anatomy of a Distributed System

To serve as the backdrop for this article, let’s define a fictitious retail system. Illustrated in figure 1, you can see it is made of multiple components and services.

Figure 1. Components in our fictitious e-commerce solution.

In this system, you have a user-facing segment responsible for all activities up until the order is placed. Up to this point, the decision has been to make synchronous calls, but once the request to place the order is received by the Order Service, all subsequent interactions between the involved services will be asynchronous, leveraging a messaging solution.

Figure 2. Using an event bus to enable asynchronous communication between the components.

There are usually three main reasons to justify this decision:

1.User involvement is no longer necessary

Once the order has been placed there is no more user interaction, so the site can be ready to serve the next user request.

2. Using messaging can improve resilience

Since the services communicate via messages, if one of them is under heavy load or unavailable, the originating service is not affected, as the message will be picked up once the affected service is back online.

3. Communicating through messages (events) makes the system more flexible

The asynchronous aspect already favors a temporal decoupling between services. However, here I am referring to the reactive nature where the Order Service, for example, does not know who will consume those messages, allowing you to add more services as requirements change.

Losing Transactional Guarantees

In the days of monolithic systems, especially those backed by a relational database, once you performed an operation, there was the certainty that it either succeeded or failed, without anything in between. You could leverage transactions to commit your changes or roll them back, allowing the system to return to its previous state in case of a failure.

In our distributed system, this is no longer possible. The persistence is likely segregated and there is no longer the commit/rollback equivalent provided at a system level.

Imagine, in our example, that a message is lost after the order is placed. The items, although paid, would not be delivered to the user.

Figure 3. A message lost in transit or during execution.

It is important to highlight that a common mistake made by those that adopt a distributed model, is to assume there won’t be problems, either in the infrastructure or at the execution of a given use case. Figure 3 illustrates a situation where the Order Approved message never makes it to the destination system — a problem in the event bus — or is consumed but the execution is aborted.

Figure 4. A part of the use case can’t be fulfilled.

Equally important is the case where a part of the promise can’t logically be fulfilled, such as seen in figure 4, where the item can’t be shipped because of a problem (ex.: damaged upon inspection/handling).

So how do we address this problem? The more traditional solution to tackle this is by leveraging a technique called two-phase commit. Let’s see the basic concepts behind it and why we chose a different approach.

Two-phase Commit

The two-phase commit (2PC) is a consensus protocol, and its operating principle is simple: you have to reserve — “lock” — the resources you want to use in their corresponding services before actually executing the desired operation.

Figure 5. The successful scenario in a 2PC.

Above we see a successful execution scenario. The reserve and commit phases can have different meanings depending on the resource. For example, within the Inventory component, reserving could mean removing one item from the available stock and committing to assign it to the customer.

There are some shortcomings associated with 2PC:

Managing can be challenging once you factor in failures

You can have failures within the component requesting all those reservations/commits (also known as coordinator) or even the participants.

Does not scale well

It has a complexity of O(n²) representing the number of messages exchanged when you have n participants involved.

Resources are locked for a long time

Since it needs to guarantee that all participants can commit before actually doing so, when you have a greater number of participants and one of them can’t commit, all the other resources will be unavailable for other customers that could succeed.

A better approach comes by applying a pattern called Sagas, which I will cover next.

Sagas

Origins

The term Saga was used in 1984 in a Cornell University research paper by Hector Garcia-Molina and Kenneth Salem. This paper discussed the problems of long-lived transactions (LLT) that exist in database systems and defined Saga as one way to alleviate the performance bottleneck associated with locking the resources involved with the transaction for the entire duration of the operation.

In a database context, the Saga is achieved when you can break the original long-lived transaction as a series of small operations where each one can still guarantee its atomic aspect, and in the case of failures compensating transactions can be executed.

Figure 6. LLT vs Sagas resource locking differences.

In figure 6, I illustrate the difference between LLT and Sagas when considering the locking of the resources involved over time. During the LLT all resources are locked for the duration of the entire transaction to guarantee its atomic nature. This usually means that those resources can’t be accessed by other requests during that length of time.

In a Saga, you break down the transaction T into smaller ones (T1, T2, …Tn) and corresponding compensating actions (C1, C2, …Cn-1) that revert the changes from each Ti. The resources are only locked for the duration of each smaller transaction, still guaranteeing the atomic promise of each Ti.

In the case of an issue while executing any given transaction Tj (where j < n), you apply the compensating actions Ci (where i <= j) that semantically revert the changes on the affected resources. This aims to reduce the lock duration, which tends to increase the throughput while still guaranteeing the atomic promise of T.

This served as an inspiration to address a similar problem presented in the previous section that affects distributed systems.

How It Works

The basic concept is simple, break the operation into smaller parts, usually at the service level, where you can guarantee the transactional context. Then connect them via messages and define what should happen in case a problem arises at the execution of each one of those parts.

Looking at our example domain, the division can be illustrated in figure 7.

Figure 7. The Saga on our ecommerce solution for placing an order.

We have two possibilities to implement Sagas:

Orchestration
Choreography

Choreography

With choreographed Sagas, the initial message is published and each service that is part of the expected process will consume it and trigger local actions. The result of such actions is communicated in the form of messages as well.

Figure 8. A choreographed approach to coordination.

For simplicity imagine that the choice was to model the process depicted in figure 8 and that a single item was ordered. In this happy path scenario we have the following chronology:

The Order Placed event is published and the Inventory consumes it, reserving the item from the stock.
The Item Reserved event is published and both Order and Payment consume it. The Payment captures the funds.
The Payment Approved event is sent and both Order and Shipment consume it. The Shipment dispatches the package to the customer.
The Item Shipped event is published and the Order consumes it. The Saga ends.

But what happens when something goes wrong? In the case of a negative outcome in a service, it will revert its state locally and publish a message that shall trigger the same behavior in the affected services. Using the same simplified example, imagine that when preparing the package, the item is damaged. Figure 9 illustrates what happens.

Figure 9. A choreographed approach to coordination handling compensation.

The Shipment publishes an Item Could Not Be Shipped event, and it is consumed by the Order, Inventory, and Payment. Payment will refund the customer, Inventory will mark the item as damaged in the stock, and the Order will notify the customer of the issue and inform them that the order will not be delivered.

The Pros:

No single point of failure
No significant technology/concept changes (if you are already using an event-driven architecture)
No bottleneck (associated with the adoption of the Saga)

The Cons:

No simple way to know what state the process is at/on
No easy way to know who is involved in the execution of a process
Updates in the process may require changes in other services

The last point is important to stress on. Imagine if you are now required to track loyalty points associated with each order that has an approved payment. This means that now we need to consume the event when the Payment is refunded to deduct the equivalent amount of points from the customer balance.

Orchestrated

With orchestrated Sagas the initial message is published but consumed by one single component, known as Saga Execution Coordinator (SEC). The SEC is responsible for deciding which service(s) should be contacted, sending commands to those services, receiving the replies, and maintaining the state of the execution.

Figure 10 illustrates the happy path using an orchestrated Saga. The chronology is as follows:

The Order Placed event is published and the SEC consumes it, starting the Saga execution and determining that the next step is to tell Inventory to reserve the item.
SEC sends a command Reserve Item to the Inventory, which receives and handles it.
Inventory replies to the SEC with a successful answer. The SEC receives the reply and executes the next step, which is to tell Payment to approve the payment. It also persists the current state (Item has been reserved).
SEC sends a command Approve Payment to the Payment service, which receives and handles it.
Payment replies to the SEC with a successful answer. Like in step 3, the SEC will receive it, evaluate the next step, and persist the current state.
SEC sends a command Ship Item to the Shipment service, which receives and handles it.
Shipment replies to the SEC with a successful answer. The SEC determines that the next step is to inform the Order service that it has completed the Saga.
The previously placed Order now is moved to a Shipped state.

The SEC maintains the current state of the Saga at all times and can also store the responses it has received. This helps in the case the SEC crashes and a new instance needs to be created, this way, it can resume from the last known step. For simplicity, the example provided is only sending one command at a time, but in practice, nothing prevents you from sending simultaneous commands to different parts of the system.

The Pros:

A simple way to know the state of the process (check the SEC state)
Easy way to know who is involved in the execution of a process (check the SEC implementation)
Updates in the process require changes mainly in the SEC itself

The Cons:

The SEC can be a single point of failure (SPOF)
New concepts (SEC) to be understood and implemented
Potential bottleneck as the SEC decides the sequence of steps

The potential SPOF is very important to consider when deciding to use the orchestrated approach, as you have to make sure the SEC implementation is resilient. Making it scalable and efficient is also something that deserves attention to reduce the chances of it reducing your throughput or adding latency to your process.

Which One to Use?

Each implementation flavor comes with its own merits, and the literature, often in a contradictory way, only offers hints as to which one should be selected given your context. In practice, you will likely be mixing both.

The choreographed option has usually the smallest entry barrier, assuming you are already using an Event-Driven Architecture, as the main change is to model the compensating actions for each non-happy path.

Since it does not provide an easy way to know the participants and the current execution state, I find it better suited when the number of services involved is small and belongs to the same domain. It also helps if the process involving those services does not change often.

The orchestrated approach has some challenges due to the implementation of the SEC and how to overcome the potential problems represented by it becoming both a single point of failure or the bottleneck of the execution.

Because the SEC knows the state of the execution of the process, so it can decide what to do next, I find it better suited when the number of services is big, or belongs to different domains, or has a tendency to change frequently. This visibility is very helpful, especially when you need to, for example, understand what services would need to be changed to satisfy a new use case or even when you need to debug the execution.

In either case, the suggestion is to keep the process small and if necessary, break it down into smaller sub-processes.

Figure 11. A bigger process is broken down into smaller ones.

To Be Continued…

Now that we have a common understanding of the problem we are facing and its possible solutions, the next article will dive deeper and focus on how to leverage AWS’ Step Functions to implement our Saga Execution Coordinator.

Editorial reviews by Deanna Chow, Liela Touré, and Pablo Martinez.