The Ack problem — Part 5

The acks that you send

Philippe Detournay
Xendit Engineering
4 min readJun 15, 2023

--

In the previous post, we’ve seen that an error response from an API call should not be treated as “not done”, but should be understood as “undefined, maybe partially done” situations. In this post, we are going to dig deeper into this.

When an API does nothing but updating a local database, it is easy to keep atomicity guarantees. After all, the database does it for us, all we need to do is to execute the call within a transaction.

But if your API needs to interact with more than one external entity (so not just your database, but maybe a message broker or a downstream API), then atomicity quickly becomes a challenge. To illustrate the problem, let’s take two examples:

Database is committed before downstream API is called

Let’s say that our API must create a new local record, and must also call a downstream service. If we commit the local DB first and then call the downstream service, the following could happen:

Downstream service has “undefined” status

Because we got some network or technical error, or maybe a 5xx coming from the downstream service, we will also return an error back to our upstream service. In reality, everything got completed successfully, so unless we check our database when the upstream service calls us again, we will likely create a duplicate transaction.

The good news is, since we committed our DB first, we will be able to detect the duplicate situation before calling the downstream service. That looks good, right?

Except that this is not the only possible scenario. Let’s consider this scenario instead:

As far as your service is concerned, there is no difference at all: the call to the downstream service is in an undefined state. You can’t tell the difference. But the reality is that the downstream call didn’t proceed at all! So if you check your local DB before proceeding with a potential duplicate transaction, then the downstream service will never be called, even if the upstream service does the right thing and calls you again!

Downstream API is called before database is committed

So maybe the solution is to do the other way around, call the downstream service before committing the database. Let’s consider some scenarios again:

Here, the database commit got completed successfully, but you still got an “unknown” state from it. This means that, like before, you will return a 500 or some sort of error to the upstream service. The upstream service will retry, and if you perform a DB lookup before calling the downstream service then all is good! Since the DB got committed, you have a record of the existing transaction and you don’t call the downstream service again. Finally, we solved it, right?

Except that no, let’s consider this final scenario:

In this case, the commit did not succeed. Again, let us stress that your service cannot tell the difference between the two scenarios. When the upstream service calls you again to retry the failed transaction, you will have no record for the previous call, and you will call the downstream service again, generating a duplicate call.

Oh my…

But… then what?

A lot has been written and discussed over the last twenty years on this topic. For a while, two-phase commits and distributed transaction coordinators were considered: two-phase commits are fundamentally a solution to extend the transaction across several service. But these required a lot of effort and still had blind spots, eventually ending in “unknown” states. And performances were generally not that good.

An interesting discussion as why two-phase commits should be avoided in micro-service infrastructure can be found here.

In some cases, we still need some level of “higher-level transactional support” where a full API call involves several sub-calls such as “reserve this” followed by “commit this” and built-in timeouts etc. This generally requires orchestration services and we typically learn about this when we read our first introduction to micro-service and then immediately forget about them. Basically these approaches are modern equivalent of distributed transaction coordinators.

But while orchestration can be a solution in complex cases, you don’t want to create such an infrastructure for each new API. In the next post, we’ll discuss a possible solution that, while requiring a bit of rigor and care when designing your APIs and messaging schemas, is quite cheap to put in place and easy to test once it is used consistently across all your services.

--

--