Saga Pattern and Microservices architecture

9 min readFeb 11, 2018

Introduction

Saga Pattern was originally published in ’80 as a solution for distributed transactions on relational databases. In the core of that idea lies an assumption that for complex and distributed transaction it is not possible to do typical rollback. Reason of that is we can’t lock entities for the whole transaction lifecycle which in consequence means there will be some artifacts visible for other threads before Saga finishes. In such case instead of doing rollback Saga Pattern requires that it should be possible to semantically go back to previous state before transaction was executed. Saga itself is a set of actions that has to be executed in an arbitrary order. For each action it is required to specify a compensation action that semantically restore the state from before the action was executed.

Microservices is relatively new concept that is out there for less than 10 years but already had big influence on the way we produce software. Mature projects based on that approach can work on the network of hundreds of microservices. One microservice can have multiple dependencies which can be very challenging to maintain. Such network of dependencies looks very similar to problem we had in ’70 that was called ‘spaghetti code’. Touching single microservice can have unpredicted impact on the whole network if not done carefully. To solve ‘spaghetti code’ problem we introduce sophisticated structures and methodologies that helped us put some constraint on the code. Now we need to do the same for microservices. Surprisingly answer to such problem can be found in the past — Saga Pattern that was designed 40 years ago.

Before we try to connect those two concepts and work out a solution let’s take a closer look at typical microservices architecture.

Microservices architecture

Let’s consider hypothetical microservices architecture that contains 16 (named A — P) loosely coupled microservices that connects to each other. On this hypothetical architecture we define three different requests called ‘Red’, ‘Blue’, ‘Green’ and each of the requests involves different set of microservices to get fulfilled. To simplify the flow, we are assuming that each request can be represented as a flow between microservices where output from one microservice is the input for another.

Having defined such connections map visualized we know exactly all dependencies between microservices. In this example it was pretty easy to build it. In real world we would face hundreds of microservices with thousands of connections between them. Although we can divide microservices into two groups — one group is used only be a single path (like microservice A) where on another group we see microservices involved in multiple paths (like microservice H). Usually second group is much bigger.

Updating microservices from first group is relatively easy and affects only one path. As for second group we can imagine that it is required to update one of those services because one of the paths requires new functionality but such update also affects other paths!

Before we try different model let’s agree on slightly different naming convention here. From now, we are going to call each microservice as a statement which is considered atomic, indivisible and meant to execute single command. On the higher level let’s consider request flow an action that connects multiple statements in an arbitrary order and represents a measurable value for the user. Action is considered successful only if all subsequent statements finished without errors.

For example, action ‘Red’ is defined as ‘A, G, H, K, I, C’ which means that those statements must run exactly in that order. In such case action is considered successfull.

At this point we could consider actions as ingridients for distributed transaction and try to implement Saga Pattern based on the above assumptions but before we do that let’s try to organize it a little bit. First of all, we are going to try get rid of those dependencies between the statements. Easiest way to do that would be duplicate those statements that are used by more than one action. This way we would have fully independent actions.

Drawback of this approach is that we would have to duplicate some of the microservices which in most cases is not a good idea.

Action Polls

Until now we were looking at the actions and statements on a very low level. Although statements could be considered as a function, actions are more complex and for example can be described as “Persist data in DB”, “Send notification to the user” or “Transform file in schema A into file in schema B”. We can imagine that actions persisting data in DB and sending notification to the user don’t have much in common and probably are using completely separated actions. From the other hand actions described as “Serialize data and send using TCP” or “Serialize data and save on HDD” have much in common and may share some of the actions. With this observation we can try to form a new concept called action polls.

Action poll groups actions organized around common topic. Inside a poll actions may share statements but it is not allowed that one action share statement with another action from different poll. To minimize amount of dependencies action poll should be relatively small. Communication inside the poll is not a part of the concept and we can imagine that statements can communicate directly through REST API or indirectly through message engine (like Kafka).

Although concept of action polls is interesting we can imagine that actions themselves are atomic and don’t bring real functionalities to the users. To make it usable we need to be able to chain them into more complex process. Communication inside a poll between the statements is a similar problem like communication between polls and may have similar solution. Personally, I recommend using ME as an implementation for integration bus in that case.

Saga Pattern

Now, when we have action polls let’s try to design something more complex — a request that requires to invoke multiple actions, possibly from different polls. Request is consider fulfilled only if all subsequent requests finished successfully. In case of any failure our solution should rollback to previous state or redo — depends on the strategy.

Before we go further let’s take a closer look at the failures. If action is invoked it should go through predefined flow and send back expected response — any deviation from that is considered as a failure. We could design our action in a way that detects at least some of those failures and react accordingly but such approach makes are actions far more complex. Predicting all possible errors and designing remediations for them is a cumbersome process. Instead of avoiding failures let’s try to embrace them and build into our solution. As we’ll see in the next paragraphs — Saga Pattern is the answer for such problems too.

First thing we notice when we look at above diagram is its simplicity. It describes generic atomic action that invokes process based on given input. If request was transformed into response and send back it’s consider as a success. In case of any failure during request transformation action reports an error and stops processing the request any further. From now we are going to consider this as a high-level description for atomic actions inside the action polls.

Before we try to form a Saga based on atomic actions we defined we need to introduce new concept — compensation action. Definition of such concept is quite simple — if we assume that invoking an action creates some side effect then invoking compensation action removes those side effects. Invoking compensation action isn’t an equivalent for typical rollback because before it was run side effects existed and was visible on the system and because of the nature of the system we design such behavior is inevitable. Compensation action can’t prevent existing of those effects but can semantically bring back the state of the system from before original action was executed.

To manage Saga through its lifecycle we need a new component called Saga Execution Component (SEC) which is responsible for running actions based on defined Saga, monitor for failures on any of those actions and triggering compensation actions if necessary.

Saga Execution Component, action and compensation action are three most important entities on Saga Pattern and having them we can try to build a simple Saga as a set of actions (T1, T2, T3, T4, T5) and compensation actions (C1, C2, C3, C4, C5). We consider that Saga is completed if all five actions finished successfully. If any of the actions failed to execute then compensation actions are fired in the opposite order starting from the failed one. For example, let’s assume that action T3 fails, then in order to bring back the state from before Saga was executed SEC fires compensation actions C3, C2 and C1 in that order. Such approach when we execute all compensation actions is called backward recovery.

On above example we assumed that any of the actions can fail but we didn’t say anything about failures on compensation actions. Since both type of actions are defined in the same way and are a part of the same action polls they can fail as well. Another question is if all actions requires compensation actions. To make it look more bulletproof let’s define that every action that creates side effects requires compensation action and every compensation action has to be idempotent — which means that compensation actions can be safely run multiple times.

In case of a failure SEC restored previous state of the system and we would probably want to try again to execute the Saga. In case of a small Sagas it may be a good approach but in most cases, it’s a waste of resources. Instead of compensating all successfully executed actions lets introduce a concept of save points as a point in Saga that stops compensation process and tries to invoke the actions again. Let’s change our example of Saga with five actions and put a save point (P1) after action T2 and this time we are assuming that action T4 failed.

T1, T2, P1, T3, T4, T5

Once SEC detects failure on action T4 it executes compensation actions C4 and C3. Then SEC notice the save point P1 and stops compensation process. Once compensation process is stopped SEC executes action again in given order starting from action T3. Such approach is called backward/forward recovery.

If we define save point after each action or define a Saga with idempotent only actions that can be safely rerun in case of failure we have pure forward recovery.

Saga log as a system health check

Although we defined a solution that can react of any failures and automatically put a remediation action we still need a way to monitor the state of the system. Monitoring every statement would be a start but it’s a very low-level approach. Since we operate on much higher level of abstraction (Saga) we can imagine that monitoring could be done on the same level of abstraction. Until now we didn’t mention any logging capability which in fact is part of original Saga publication. Every Saga creates a log that contains a list of every step taken during the saga execution from the saga beginning until it ends. Such log contains those steps (actions, save points, compensation actions) in order they were taken.

Looking from the top level we can monitor each saga success rate and detects broken sagas if failure rate reaches arbitrary defined threshold. Going on lower level we can monitor success rate on each statement to detect unstable statements.

Summary

I suspect Saga Pattern may not be the cure on every case and there are systems where this pattern may be very hard to implement. After all it was designed in previous century! From my experience it helped me to organize and clean up processes I had to design on our platform. When you code microservices you in fact code for errors. Before we considered Saga, we tried to create bulletproof services which were all but micro. Implementing the error handlers for infinite error possibilities brought unnecessary complexity into the services and slowed down the development process.

Once we moved error handling outside the services itself and let Saga handled them they become lightweight again. If we assume that each process can fail — let if fail instead of trying to predict every failure. Such approach shortens the development process because developers doesn’t have to design complicated test cases.

Working on Saga Pattern deeply changed the way I look at the development process. Solutions brought by DevOps movement like Canary deployment changes the way we think of software delivery. Although deployment and testing strategies are outside the scope of this publication all those topics are tightly connected and influence each other.