Introducing Sage — a Sagas pattern implementation in Elixir

Distributed transactions are hard and expensive, if you wonder how to pragmatically handle them in a mid-size project— this article is for you. We will discuss how can we use the Sagas pattern to run a distributed transaction from Elixir on examples that leverage Sage package. As a bonus, you will see how to use Sagas to organize your domain contexts.

What problem are we trying to solve?

Most projects I’ve built are integrated to external systems. It’s how modern development looks like — you implement your domain logic and outsource rest to SaaS services, nobody likes to reinvent the wheel.

Good examples of those services are payment processors and CRMs. Microservices are another one. One can argue that every time you perform more than one state change when they are not covered by a single ACID database — you run a distributed transaction. And we are not talking about large projects that distribute because of scale or the ones people are doing to research, it’s pretty much any small- or mid-size project that uses Stripe (or anything else) to outsource the billing system.

Sage itself was built whilst I was integrating Stripe with one of our projects, with Stripe you create a customer first and then you create a subscription for that customer. But when we failed to create subscription — we should not keep that customer in Stripe and need to delete it to get rid of side effects.

Booking website example

To make the problem more approachable, imagine we are building a trip booking website and charge customer only once when the request is fulfilled.

Here is the happy-case code that leverages with expression syntax:

Happy-case code for our trip booking website

Another requirement would be that we should not hold any bookings if we failed to charge the card, otherwise it would be a bad business for us because we would still pay for those bookings. So our code should be extended to handle that failure:

Example where we use named stages to catch where we got an error

Here we used a simple trick — we wrapped charge call with a tuple, which has stage name and list of side effects which we must take care of if we failed on that stage.

So when charge failed and booking is cancelled. But how it would look like if we want to book a car, a flight and a hotel within the same trip? If one of the bookings failed — we would need to cancel other ones and if we failed to charge a card — we should cancel all of them:

Now we are creating multiple bookings within the same trip

So we get more named stages and we collect side effects manually, which makes error handling large and error prone. And this example doesn’t even handle scenarios were we made a successful booking request and did not receive the response (eg. because of timeout), so we hold the booking without knowing about it.

To handle this edge-case we can’t think of bookings as of a single stage, we must split them and duplicate error handling:

A common question here is how can we delete something if we get a timeout, one of the most common ways — is to search entity by data we have and then delete it. It can be simplified when service you using allows to search on metadata, so we would be able to generate transaction ID and then attribute created entities to make lookup easier.

This code looks too complex, isn’t it? Now imagine how much bigger it would be if we want to explicitly release the authorization so that customers don’t have to wait for timeout to get their money back..

I believe code gets worse because we are dealing with distributed transaction in ad-hoc fashion. And that’s not all downsides, let’s name a few:

  • Duplication makes code error prone, we may refactor it but it would be still easy to update the logic in one place forgetting about the other ones;
  • We can’t book concurrently which is bad for our latency;
  • The syntax to track step on which failure occurred is ugly;
  • To cover this code test you would need a lot of stubs that inject errors based on attribute values.

One of solutions would be to use two-phase commits, but they don’t scale: in best case O(2n) messages are spawned (and up to O(n²) with retries); it hurts availability because of locks involved. And what is more important, vast majority of services simply don’t support it.

In my opinion, good tool should not only address those issues, but take an additional step forward by giving you a new mental model. And if it makes code better organized — even better.

What is Saga?

Saga is a very simple failure management pattern that originates from 1987’s paper on long running transactions for databases. It’s original use case was implementing long lived transactions without locking the database. Those transactions, by name, can take a while because they should go through a large dataset and make the system unavailable due to various locks that needs to be placed, which is usually not desirable.

A long lived transaction is a Saga if it can be written as a sequence of transactions that can be interleaved with other transactions.
The database management system guarantees that either all the transactions in a Saga are successfully completed or compensating transactions are run to amend a partial execution.

What does that mean in practice? A saga is a distributed transaction which takes care of overall consistency of collection of steps that internally perform atomic transactions. Those steps consist of subtransaction and compensation to amend it’s effects.

Compensations are semantically undoing the transaction effects, eg. if you sent an email confirmation you can not “unsend” it, instead — you can send a follow up email with an excuse for the error.

Getting back to our booking website, here is a visualization showing how it would work with Sagas:

After error is occurred compensations are run to amend partial execution

While collecting feedback I’ve received a very good quote by @jayjun from #elixir-lang Slack channel:

It’s like Ecto.Multi but across business logic and third-party APIs.

What are the tradeoffs? With Saga you tradeoff atomicity for availability.

Introducing Sage

Sage is a dependency-free pure Elixir library inspired by Sagas pattern. It provides set of additional features on top of steps, transactions and compensations defined by the original paper. Here is how Saga for our booking app might look like:

Saga for booking site with Sage

Transactions and compensations should implement these callbacks:

Sage callbacks specs

Asynchronous transactions

Sage allows to run stages asynchronously, next synchronous stage will await for them to return. This is because we want synchronous operations to have access to effects created by their predecessors when asynchronous ones don’t have access to the each others effects.

However, they do have access to effects created by previous synchronous operation and it’s predecessors.

Whenever there is an error in one of async stages — compensations would run sequentially to amend their executions.

Run three stages asynchronously

Retries

By default, transactions are executed at most once and compensations are executed at least once but retries allow you to define save points to execute forward recovery for a limited number of attempts.

Retry execution from a savepoint

To implement a save point you write a compensation that still does it job and then tell Sage to apply forward recovery with limited number of retries, optionally with exponential backoff and a jitter:

This applies additional requirements for your transaction and compensation callbacks — they must be idempotent.

Internally Sage persists retry count for Sage execution, this is made because we don’t want to forward-retry indefinitely. So when there is another step that already retried the execution for 5 times, retry is no-op and we would continue running compensations.

Circuit breaking

Whenever there is a stage with transaction error, compensations that are executed at this stage can continue execution by providing some default effect. For example, you can cache response on successful executions and use that data when it’s not possible to retrieve it immediately.

Keep executing Saga with cached currency exchange rates

Final callback

It’s possible to attach callbacks that would be called whenever Sage execution is finished successfully or after all effects are amended. For more details see Sage.finally/2.

Tracing

Also, you can attach a module that would receive instrumentation events. See Sage.Tracer behaviour.

Works with Ecto

You don’t need to write compensations for local database that supports atomic transactions. Instead, wrap Sage execution in a transaction which is committed when execution succeeded or rollbacks otherwise.

Organizing your business login in Domain Driven Design contexts

Lately we have seen the promotion of DDD practices in Elixir community, particularly it started when Phoenix Framework introduced contexts. When there is a lot of business logic spread out across contexts, the code which is responsible for their orchestration becomes pretty complex.

One of ways to organize them is to rely heavily on Ecto.Multi and compose transactions at high-level functions. Where Ecto.Multi doesn’t solve the original problem, Sage can be used in a similar way.

Probably oversimplified example 😅

Error handling

Let’s see what can go wrong?

Sage transaction abort. This is very basic case when transaction can not be completed, we start backward recovery and amend all created effects.

Compensation failure. This is something Sage does not generically handle for you. Whenever compensation can not amend the transaction Sage will raise an error and expect developer to manually investigate the issue. However, it’s possible to write an adapter that would handle those cases, eg. by retrying compensations indefinitely.

Bugs in Sage. Currently not handled, but..

Things to come

  • Saga execution log. Persistence for Saga execution to tolerate failures in process or on the node where execution coordinator runs. Additionally, it would be possible to recover when there was a bug in Sage executor.
  • Plug-style module callbacks. Instead of passing anonymous functions you would be able to pass modules that implement common behaviour, which should make composing Sagas even more fun!
  • Compile-time type checking. Dialyzer should warn you when you compose Sagas with violating their dependencies on other stages.
  • Event collector. No events should be emitted from within the transaction, instead they can be returned from it and fired only from final callback when the transaction is committed.

Short recap

Sage is particularly useful when dealing with:

  • Microservices (maybe not the best case because you have control over them and there are much more options to consider).
  • Data which is split between multiple databases or partitions, which is somewhat similar to microservices.
  • When communicating with external services (I think this one is where Sagas fit the best, because here we interacting with a system which is out of our control, has limited set of features exposed over API and unlikely to change on our demand).

Sage can help to elegantly organize logic in your domain contexts.

It’s well-documented and covered with tests.

Related materials