The Ack problem — Part 6

Idempotency

Philippe Detournay
Xendit Engineering
5 min readJun 15, 2023

--

In the previous post, we observed that non-trivial APIs cannot be made atomic without some significant efforts. As a consequence, any network or technical error (like a 5xx error code in a REST API) can indicate anything from “nothing was done” to “everything was done” and any potential intermediary, partially updated state in-between.

If at first you don’t succeed…

…then skydiving isn’t for you.

Idempotency is typically put forward as a solution to the Ack problem. It is made under the valid assumption that retrying a failed call over an idempotent operation allows us to resolve the “uncertainty” we received as the result of an “indeterminate state” result, and without facing the risk of a duplicate transaction. But we typically don’t spend enough time defining the term properly.

A layman definition would be: “Idempotency is when calling an API twice with the same parameters produces the same result”. At first this definition makes sense, but it fails to address a very important question: what does this mean when a call leaves the server in an undetermined or partially updated state? By this definition, calling the service a second time with the same parameters should leave the service in this same partial state. That is not helping at all!

Let us try a slightly more detailed definition:

An operation is idempotent when:

- It has a well-defined and documented idempotency condition, under which it exposes the idempotency behavior;
- It has a well-defined and documented expected positive and negative outcome;
- Subsequent invocations of the operation within the idempotency condition will:
1. Either not alter the state or bring the state further towards the expected outcome of the initial call;
2. And will not expose more visible outcomes than those expected out of the initial call.

Let us explore each element of this definition separately:

  • It has a well-defined and documented idempotency condition: the caller can specify whether it is interested to “retry” a previous call, or to initiate a new one. Typically, via a specific “transaction reference” or “idempotency key” that is part of the operation parameter;
  • It has a well-defined and documented expected positive and negative outcome: the “done” and “not done” must be defined, such as what exactly is expected to have happened when a REST HTTP 201 or REST HTTP 403 is returned. This includes both the returned values or bodies as well as side effects;
  • Subsequent calls may leave the state unchanged: if an operation is retried (with the same transaction reference) for a call that was previously fully “done” or fully “not done”, the service may opt to just return the previous returned value without affecting the state. Moreover, if the previous call caused the state to be “partially done” but the problem persists (i.e. the DB is still unavailable), it may decide to continue returning an “indeterminate” code (i.e. 5xx);
  • Subsequent calls may bring the state further towards the expected outcome: if a previous call to a downstream service could not be completed, this call may be retried provided this call is itself idempotent. Similarly, DB can be further updated etc;
  • Will not expose additional or different user-visible outcomes than those expected out of the initial call: a subsequent call made within the idempotency condition must not cause user-impacting duplicate transactions. Duplicate transactions that won’t have any visible user effect (such as logging etc.) are allowed, however. So if the initial call is expected to create a new object for the caller and return its identifier, then the subsequent calls must either:
    * Create the object if it does not already exist, and return its identifier;
    * Or complete the partial creation that was aborted during a previous call, and return its identifier;
    * Or return the identifier of the object that was previously fully created under the same idempotency condition.

Since the return of the object identifier is part of the expected operation outcome, it must be returned in any case (except if another technical error is returned instead). Returning an error like “this object already exists” is NOT an idempotent behaviour (see below).

…try, try again!

We will see typical patterns of idempotency implementations in both API servers and message consumers in a subsequent post. For now, let us observe examples of non-idempotent implementations.

Caching the return body of successful calls and serving them in case of retries is not sufficient to have an idempotent implementation: this does not address how to progress the state during retries after a partial state result, nor does it help preventing unwanted duplicates upon retries over partial state.

Rejecting duplicate calls (i.e. via a HTTP 429) is an invalid idempotency implementation: the subsequent 429 error is not part of the initial expected outcome, hence the subsequent call exposes different user-impacting outcomes.

Generally speaking, any logic/lib/layer/middleware that aims at handling idempotency without requiring any effort from the operation core logic is inherently flawed. As we have seen, idempotency requires the ability to move a partial state further towards the expected initial outcome, which is not something that can be done independently from the operation logic.

It is worth observing that “shutting down the server” IS a valid, if extreme, idempotent behaviour (since it leaves the state unchanged and “returns” a technical error). There are case where this behaviour is the only practical one, such as in the case of a Kafka consumer finding itself unable to process a critical message.

Generally speaking, if “shutting down the server” is not a valid XXX strategy, then XXX is problematic to start with: any computer can and will be shut down at any time without prior notice.

Drake will get it right the next time he tries!

In the next two posts, we will (finally!) explore potential patterns and solutions, first in a typical API server, then in a message consumer.

--

--