The Ack problem — Part 2

The acks you get from the DB

Philippe Detournay
Xendit Engineering
3 min readJun 6, 2023

--

The atomic uncertainty

In the previous article, we learnt that if we receive a network error (or any low-level unexpected technical error) while waiting for a response to a request, then this request is likely to have been processed anyway.

From a network point of view, a database call is just another request-response interaction. This means the Ack problem is relevant. In most cases, it occurs during database commits.

Almost all databases offer some level of atomicity guarantee. What differs is the scope and the level of control that you can have over it, but at the end of the day they all offer a very important statement: “this operation is atomic, i.e. it is either done or not done, but won’t be stuck half-way”.

For instance, you can request PostgreSQL to execute these two statements in an atomic way (i.e. in a transaction):

INSERT INTO my_table(id) VALUES('id1');
INSERT INTO my_table(id) VALUES('id2');

Once the call returns, either both values will be inserted, or none will be. ONCE the call returns.

But what if the call DOESN’T return? Or what if the return is an internal error, or a network error, or something similar?

Committed? Uncommitted?

The database guarantees you that the changes are either done fully or not done at all. But the database doesn’t guarantee that you get to know which road was taken. If the database gives you an explicitly response, then all is good. But if you don’t receive a proper response then you are facing the Ack problem, and you just don’t know.

And as we’ve discussed before, the database has likely committed your transaction anyway…

Don’t refresh your browser!

A lot of developers got first introduced to this problem when developing APIs for Web applications. If the database returns something unexpected, then the application will almost always “forward” this error in one form or another to the API caller.

When this happens, it is generally accepted that the caller (in this case, a human being) will perform investigative actions like refreshing the browser page and check whether the transaction was made or not before trying to submit the call again.

If the Web application was poorly designed, refreshing the page would automatically submit the call, leading to a risk of duplicates. The browsers are aware of this risk and tries to be helpful, sometimes:

But nowadays, most APIs are meant to be consumed in an unattended, fully automated way. This means that either:

  • The APIs provide enough “investigative” support, like query or list APIs to inquire whether a transaction took place or not, and make it easy to reconcile the data in an automatic way (i.e. by taking a caller-generated reference and allowing to query from it). This also assumes that the caller goes through the effort of implementing these recovery steps;
  • Or the API is designed to be idempotent, i.e. it will automatically enquire, upon retry from the client, whether a previous equivalent transaction already exist, and will react accordingly. This is not always easy to do and we will discuss possible technical solutions and patterns in a future post.
Oopsie…

We’ve seen that just returning a “500” error when we receive an unexpected error from the DB may have unintended consequences. In the next article, we’ll discuss about a similar case about the message broker.

--

--