Distributed Transactions in a Cloud-Native, Microservice World

Yesterday I was having a great conversation with a bunch of super smart people (if you haven’t tried surrounding yourself with geniuses, you should try it sometime), and the topic of dealing with distributed transactions in cloud-native, microservice architectures came up.

First, let’s take a look at the problem. When we build the big enterprise apps and servers that we’ve been building for years, sometimes we need to coordinate multiple activities in a transaction. The classic example of a transaction that everyone uses is transferring money in a bank: you withdraw from one account and deposit into the other. If either of these activities fails, you need to be able to restore the state of your account to what it was prior to the transaction, or someone’s going to lose money. The acronym ACID (Atomic, Consistent, Isolated, Durable) comes up often when talking about these kinds of transactions. The problem becomes even more involved when the activities being coordinated as part of a transaction are distributed — they don’t all take place in the same memory space, on the same server, or even in the same data center.

Many of us have spent years of our lives drowning in the quagmire of distributed transaction coordinators, compensating resource managers, and the rest of the nightmare induced by distributed transactions. Transactions are a hard problem, even when isolated to a single process on a single box. Companies make their entire fortune making and selling transaction management libraries, components, servers, etc.

So now we’ve decided to embrace the cloud and we’re all-in on the concept of microservices and we love our 12-factor applications — now what? Unfortunately, there’s no easy fix, no salve one can apply to stop the bleeding on this wound.

Single Responsibility Principle

I’ve seen this particular problem a number of times: we decide to migrate a service to the cloud. The service uses a transaction manager to coordinate transactions, so how do we make this work cloud-native? One reason why a service might need a transaction coordinator is that it is doing more than one thing. If you’re transferring funds, then the service might be able to deal with transactions internally. There’s nothing wrong with hiding transactions inside your service, the problem comes when you need a transaction to span across multiple services.

Before trying to figure out how to do that (discussed next), ask yourself if the existing service is violating SRP. If it is, refactor it so that it is a true microservice in every sense of the word. Once you’ve got multiple microservices that don’t violate SRP, you may find that some of your presumed need for transactions falls away, leaving you with a smaller subset of transaction problems to deal with.

You may also want to avoid using the money transfer analogy when talking about this problem. A single service that transfers funds from one account to another is not necessarily a candidate for distributed transactions across microservice boundaries. This is ugly, difficult work, so before trying to do it the hard way, see if there aren’t other designs or refactoring patterns you can adopt to avoid the need for a transaction to cross service boundaries.

Natively Transactional APIs

A pretty common mistake people make when designing RESTful services and APIs is to directly map the resource model (the URLs exposed by the service) to the internal data model. This problem continues to proliferate because tools and scaffolding systems often default to this approach (I’m looking at you Rails, and you too, MEAN). While this is handy for a simple hello world just to get things up and running, it is rarely ever the best approach to exposing a public API for interacting with your system. This is especially true in the case of transactional systems.

Modeling interactions this way is often a sore spot with REST purists who claim that this is where the beauty and purity of REST breaks down and people end up “hacking” resource models to bend to their will. I will leave that philosophical discussion for another time.

Let’s say you have a service exposed that allows you to buy something. We don’t normally think about it like this, but this is a natively transactional API. The act of buying something is either the start of or in the middle of some kind of purchasing workflow, where all activity is correlated to an transaction (an Order ID, for example). The stuff we bought, how much we paid for it, when we paid, where it’s going, and when it ships all belong to a single transaction. This transaction may spawn other back-end transactions, but you get the idea.

When you POST to a resource collection, say /orders, you get a new order and a new transaction has been created implicitly. When you then interact with that order (check status? GET /orders/ABCXYZ) you are interacting with a long-lived system transaction. This is usually a very different feeling than the short-lived database transactions we often use to justify the use of a distributed transaction coordinator.

I realize there are always exceptions to every rule, but one to really try and live by is this: if you need a DTC you’re doing it wrong. There is almost always a better way and, in many cases, refactoring out the need for a DTC provides huge ROI in terms of other benefits — elegant design, performance, scalability, reliability, etc.

Assumptive Failure and Eventual Consistency

In a classic DTC (distributed transaction coordinator)/two-phase commit scenario, some bit of code starts a transaction. Other bits of code in other locations then perform their pieces of the larger task and they vote or elect on the outcome of the transaction. If everyone voted thumbs up then the transaction commits. If some piece of the transaction was unable to complete, it fails outright or votes thumbs down, and the transaction rolls back.

This is all well and good when you’re running a monolith on a single box, the DTC is under your control, and you have a synchronous process where you can sit and wait for all the work to finish. This is not the case in the cloud-native world.

In today’s applications, a transaction can start, some work takes place in parallel, other pieces of work may be advanced asynchronously from a mobile application triggered by the user, and further work moves the transaction asynchronously from back-end processes. Now remember that this is running in the cloud — some of the transaction participants may be temporarily out of network contact with each other as they move from one virtual data center to another. Others might be dynamically scaling, so you could start a transaction with 2 instances of a participant and finish it with 30.

If you try and apply the “DTC” mindset to this problem, your head will very likely burst into flames.

To give a concrete sample of this that we’re all likely familiar with, take a look at the process of buying tickets online. This could be for a movie, could be for a concert, or for a flight. In a flight scenario, you start by searching flights, but at some point you start the process of booking. You then see some kind of message that says that your seats are reserved for the “next 5 minutes” or something like that. If you don’t complete your ticket purchase successfully in that time frame, you lose the seats.

This is where the pattern of eventual consistency comes in. If you’re not familiar with it as an architectural pattern, you should definitely look it up. It will save your life, will make you coffee in the morning, and tuck you in at night.

We know that our systems are spread out across the cloud, that our services are being invoked in unpredictable sequences, and sometimes some of them might need to be re-invoked to deal with random failures throughout the system. Instead of being afraid of this, we need to accept this as our new reality and embrace it.

In the case if booking a flight, you get a transaction when you go beyond some point online. The transaction then starts a countdown timer where, if it doesn’t get completed successfully, failure is assumed and the seats you reserved are released.

This type of transaction is designed specifically to deal with the unpredictable, eventually consistent nature of things. In an old enterprise app, you might perform a single distributed transaction to book a flight or rollback a flight booking. In this new design, you embrace the fact that the state of a booking is eventually consistent and that a transaction is considered a potential failure until proven otherwise.

Again, this isn’t the only solution to the problem, but it’s an example of how embracing the unpredictable, cloud-native nature of things can give you benefits above and beyond simply porting a DTC-dependent process to the cloud.

There are a number of concrete patterns for dealing with failure in and outside of transactions, but I think that’s a larger topic better covered in another post.

So, in summary: if you are looking for a strategy to port a DTC-dependent process to the cloud, try thinking about it differently. Instead, ask yourself how you can embrace the spirit of microservices and cloud-native architectures to refactor and redesign away the need for a DTC. You are likely to find that in that new future, there are other great benefits.