Keep consistent state across micro-services thanks to Saga pattern

Aymeric Bouzy
Tech Cubyn
Published in
7 min readAug 26, 2021

--

Photo by Denys Nevozhai on Unsplash

When developers who have experience with a monolith architecture discover micro-services, one of the first doubt that come to their mind is “how are we going to keep consistent data?”

Think of it: in a monolith stack it’s quite simple, you can simply rely on the ACID properties of your database. Even if you need to update 10 different things in your database at once to keep a consistent state, you can simply wrap in a transaction, and you’re done!

With micro-services, it gets a lot trickier, since each service has its own database. Take for example the following story:

When I make a reservation, the resource is locked, my payment method is used and a new billing item is added to my next bill.

If each of these 3 actions are handled by a separate micro-service, what do you do if any of the three fails?

The naive approach if you don’t plan for failure is to simply do all three actions in sequence. But if the payment method fails, the resource will be locked forever 😕

So maybe we can add a specific case for that? Or a cron that detects such locked resources, and releases them? Or maybe we can keep open the transaction on the first service, and commit it only if the whole distributed transaction succeeds? …What is the best approach?

If the payment method succeeds, but then adding the new billing item fails, it’s yet another case to cover 😞

The pattern

The general advice to deal with these issues is the saga pattern. In simple words, if something goes wrong at some point in the sequence of actions, you do compensating actions in reverse order.

For instance, if locking the resource succeeds and then the payment fails, you should unlock the resource.

If locking the resource succeeds and the payment succeeds, but then updating the bill fails, you should first cancel the payment, and then unlock the resource.

That means that an inconsistent state cross services can still be seen by clients for a short amount of time?

Yes.

In practice, it’s very often not a big issue. It’s called “eventual consistency”, and it’s perfectly accepted by most humans 😁

Let’s take our example of resource reservation : what is the worst that could happen? If a concurrent request asks to reserve the same resource during the exact time when it was reserved, the request will be rejected. It’s a bit annoying, because if it had come a bit earlier, or a bit later, it would have succeeded 😞

But there is a good chance that you will either publish an event about the resource being available again, or have some polling mechanism to retry (or you rely on your user to retry a bit later 😁). Since you need to have this anyway, it’s not a big problem.

This is going to be annoying to implement no?

At Cubyn, we have 80 micro-services, and we’ve run into countless situations where sagas have helped us keep consistent state across them.

We’ve identified this pattern to be really simple to read:

By registering a compensating action after each successful action, it’s really hard to get it wrong 😅

Previously we had a very imperative approach to writing which compensating actions should be run and in which order, and it was very error-prone. It got even more complex with sagas with optional steps for instance! Having this array of closures that will be run declaratively in reverse order makes it very easy to implement the pattern.

Why do the compensating actions in reverse order?

If you don’t do the compensating actions in a specific order, you might run into issues where your system reacts in an unpredictable manner because it could see the updated state from the transaction in the other services.

By doing the actions in reverse order, you generally run into way less issues, because the intermediate states are states that can also be observed when the saga is still running forwards.

These 2 temporary intermediate states are undistinguishable for instance

Surely you don’t want to cancel the payment if inserting the billing item fails, you can probably fix that billing issue later

Yes! That’s a very good point. Some inconsistent states are acceptable if they don’t last forever, and are preferable to our users rather than reverting. Taking this decision is up to you, it really depends upon your business.

Sagas have a concept of pivot transaction, which is the step from which we cannot go back anymore. In this case, we could consider that processing the payment method of the user is the pivot transaction, and from then onwards, we should simply take into account the update, we cannot undo anymore.

Typically, we implement this at Cubyn by publishing a topic (a message), with listeners for that topic in the various services that need to update themselves, but that have no business reason to fail. Since we have automatic retries (read our previous article!), we know that we’ll reach a consistent state ultimately.

If all the different steps have no business reason to fail, you might even consider not using a saga at all, and simply publishing a topic for all services to update themselves!

In our previous example, it’s not the case: processing the payment method can fail, for instance if the credit of the client is not sufficient anymore.

We could consider doing the first 2 steps in reverse order : first process the payment, then lock the resource.

The issue with doing it this way is that it’s more costly to have to cancel the payment, than to have to unlock the resource. So we prefer to first do the action that will have the least cost in case we have to undo, and then the action that would cost us more.

We would have to cancel payments very often with this sequence

I heard there are 2 types of sagas, “orchestration” and “choreography”?

Yes indeed! The suggested way to implement a saga shown above is an orchestration: you have one service driving the saga, and the other services don’t even necessarily know that what they are currently doing is part of a saga.

A more accurate representation of the messages exchanged between services in an orchestrated saga

With choreography, each action publishes an event after having run, which triggers the next action in turn.

All services know that they are part of a saga

This has very strong resilience in case of failure, because the saga will always eventually finish, once all events have been processed.

This is a big drawback of the suggested implementation above based on an orchestration: if the pod that orchestrates the saga is killed, the saga will be interrupted midway, and we will stay in an inconsistent state.

This drawback hasn’t been an issue for us until now, but we definitely have in mind that we have to fix this some day.

The main reason we are turning away from choreography for now is that it’s quite hard to grasp the full sequence from reading the code, whereas it can be made very explicit with the orchestration pattern. It’s also quite difficult to make evolve a saga via choreography: if you want to swap the order of 2 steps of the saga, it’s quite obvious how to do it without any breaking change and no downtime with the orchestration pattern, and much less with the choreography pattern.

One area of improvement would be to design a way to have the best of both worlds: a centralised orchestration, but entirely event driven so that it can be continued from an intermediate state by any pod.

Maybe it could look something like this?

It would work in this simple case, but we have cases in our codebase of sagas where the next steps depend on the result of the previous steps, such that static code like this wouldn’t be enough probably. Our next challenge?

We’re currently looking for 30 more engineers to join our team. We have a lot of very challenging projects on the roadmap, so please consider sending us your application!

--

--

Aymeric Bouzy
Tech Cubyn

Lead Software Engineer @Cubyn | Hiring 30+ Software Engineers 🚀