The Ack problem — Part 1

Setting the scene

Philippe Detournay

Published in

Xendit Engineering

5 min readMay 23, 2023

Introduction

This is the first article of a series on the Ack problem. There will be a total of 9 articles that will be posted in the days following this initial publication:

Sending a letter

Communication is hard. This is certainly true for communication between two sentient beings. It turns out it is also true for communication between computers.

Nowadays, even the most trivial problems will involve some kind of communication between computers. These computers may be in the same room or on the other side of the planet (or even on a different planet, if you have a very interesting job). No matter the distance, they will face the inherent “Ack” problem of any form of communication.

I’m old enough to remember a time when physical mail was used for something else than home delivery. You would write some stuff on a sheet of paper, fold it, put it on an envelope, write a destination address on the envelope and drop the envelope into a mail box. And once you did all that, the “Ack” problem would naturally come to your mind:

Did I write the address correctly?
Did I put it in the right mailbox?
What if nobody collects the mail in this mailbox?
What if the postman drops my letter during collection?
What if the letter gets lost somehow during sorting or transit?
What if the receiver doesn’t bother replying to me?
And of course, even if the receiver gets it and sends a reply to me, what if anything happens on its way back to me?

With all this in mind, it feels miraculous that physical mail was used at all…

I’m sorry Mario, your letter is in another mailbox

The biggest issue was not really that any of this could happen, but rather that you, as the sender, would be left totally in the dark: you had no way of knowing the outcome until you would get a reply. Of course today you could opt for some sort of package tracking etc, but I’m referring to a per-Internet time (and yes, we still had electricity and no, there were no dinosaurs roaming in the streets).

Synchronous communication

The physical mail example above seems to imply that this mechanism was asynchronous. With the improvement of networking and reduction of latency, a lot of protocols those days are meant to the synchronous. At first glance, it seems to address most of the issues: we know that the communication worked because we get an acknowledgement or a response from the receiver. Problem solved!

Except that, fundamentally, any communication is asynchronous. What we are doing with the query-response paradigm is that we send the letter and sit idle for several days of computer-equivalent time until we get a response letter back. The problem remains exactly the same: until and unless we get a response, there is no way to know what happened to our request.

Request-Response errors

For now, we are going to ignore problems that can happen at receiver side (like, what if the receiver did receive my letter but it then it got lost during a fire before they could reply to me?) and only focus about problems occuring during transport. Don’t worry, we’ll get back to a more complete picture in later posts.

The typical way the ack problem is introduced is via the following picture:

Your typical request-response network error lecture

This picture is… not wrong. It highlights that a network error leaves you in uncertainty: it can be caused by either an issue on the request flow or on the response flow, and you can’t decide which one it is.

But I don’t like it, as it misses the bigger picture. Here is an alternative:

The timing of a request-response interaction

On the left, we have the sending service, and on the right we have the receiving service that will process your request. The “request-response call” starts at the very top and ends when the reply is fully received. The sending service has some certainty in the “green” area:

Before it has fully emitted the request, it knows for sure nothing will be done: the receiving service would not act on an incomplete or truncated request;
After it has fully received the response, it knows for sure the request has been completed.
During the time between these two “green” areas, the sending service will be in the “uncertainty” area: the request may or may not have been processed.

But now let’s look at it from the receiving service: it will start the processing as soon as the request has been fully received. In most service implementations, the network is not monitored during request execution. This means that even if the network was to collapse during execution, this execution would proceed until completion. And for any non-trivial operation, the processing time may be longer than the parsing and replying time. In other words, any network issue occurring during the receiver’s “yellow” area will not prevent the operation from completing. And it is clear that the “yellow” area is greater than the “green” area. In other words:

If the caller of a request-response call observes a network error before receiving the response, the most likely outcome is that the request came to completion

The above has far-reaching consequences. We will explore them more in details in subsequent posts for:

Once done, we’ll zoom out and discuss what this all means in practice for more complex setups (where there are additional “hops” between services), and we will extend the discussion to failures during request processing.