Architecting Distributed Systems: API Failures

Taking a simple client-server interaction and exploring all the ways it could go wrong

Published in

Geek Culture

6 min readDec 3, 2021

In this article, I try to explain how one should reason about integrating software using synchronous remote calls (HTTP, RPC, etc.), and how a simple one-liner remote call can take a significant amount of effort to get right, depending on the resilience requirements of the overall system, and characteristics of its individual components.

const result = await remoteService.doSomething('withThis');

It is aimed at software engineers who want to dig deeper and understand how your code can fail and why you should handle errors mindfully, as well as engineering managers and product owners who want to have a better idea of the complexity behind “let’s just call that API”.

By the end of the article, you will form a solid framework for error handling, and have a reason to carefully read the docs of the APIs you are consuming.

While a remote call seems like just any other function invocation, it is typically vastly different from performing some calculations locally, rendering a UI, looking up a local database, or a filesystem. The fact that there is a network between your client and the remote server, and that they run on different physical hardware, makes all the difference.

Replacing the binary thinking

The one-liner mentioned above feels pretty binary: it can either succeed or fail. Let’s represent it as two distinct components with a network between them and see what else can we learn.

Note that this is a highly simplified diagram, you’ll see a real one towards the end.

Now that we have a simple visual representation of the components involved, we can identify the individual parts that can go wrong. Knowing that both the hardware and the network can fail, it seems we have 3 distinct pieces:

Client
Server
Network: request, response

Great, it does not feel that much of a one-liner anymore. Now, let’s take a moment and simulate a fault: let’s say, we receive an error response. The first thing that comes to mind is “the operation must have failed”, but are we really sure it failed? Is it perhaps possible that the remote server actually executed the operation successfully, even though we didn’t get a 2xx response back? What if the client died while waiting for a response…?

The above question brings us to the following three possible outcomes of any remote operation from the client’s point of view:

The operation has succeeded and we know about it: we received a 2xx response back.
The operation has failed and we know about it: we received a documented 4xx or 5xx response, explaining what went wrong.
Something went wrong and we don’t know if the operation succeeded: perhaps we got a timeout or an undocumented error response back, or crashed while waiting for a response.

While the first two outcomes seem pretty clear, there is a healthy amount of uncertainty around the situation where we don’t really know what happened. In such cases, we need to dig deeper and understand the behavior of the server in order to make a good error handling decision:

Are we expected to perform the operation at least once or at most once?
If it’s at least once, is the remote call idempotent, i.e. can we safely retry several times until a successful response?
If the remote call is not idempotent, do we have a way to check the state of the operation before retrying, and abort the retry if it’s complete?

If you’re new to the concept of idempotence, it is explained here

Architecting Distributed Systems: The Importance of Idempotence

An illustrated explanation of how idempotent systems allow for much simpler and less wasteful software integrations

betterprogramming.pub

Adding another dimension: Time

So far, we’ve made good progress going from a binary one-liner approach to understanding what distinct pieces can fail. However, we can’t draw a full picture without adding a time dimension to our diagram, since knowing when a component can fail is as crucial as knowing that it can fail.

That’s better. In the diagram above, we have the representation of client work: sending a request and waiting for a response, server work: performing the operation, and the network in between. I’ve also marked several possible error scenarios that I’d like to briefly expand on, again, from the client’s perspective:

The network fails when sending a request: the operation has failed and we know about it: remote connection wasn’t established.
The client fails while waiting for a response: something went wrong and we don’t know if the operation succeeded: the server might have received the request and performed it successfully, or crashed — we don’t know.
The server fails while performing the operation: the operation has failed and we know about it: we received one of the expected error responses.
The server fails right after successfully performing the operation and before sending the response back: something went wrong and we don’t know if the operation succeeded: the server received the request and might have performed the operation successfully, or crashed — we don’t know.
The network fails when sending the response: same as 4.

Note that the server operation part typically encapsulates other remote calls: database operations, requests to other services, or even calls to third-party systems. This is all part of the “operation” for the sake of this article.

Hopefully, by this time, the puzzle is starting to get more complete. Let’s add one more missing scenario to the diagram above: timeouts!

Clients typically won’t wait for a response forever, but will close the connection after a reasonable time interval: this is application-specific, typically anywhere between 5s and 30s, depending on how patient you think your users are. Also, most cloud providers have a hard limit on how long a single request can take, and the connection will be dropped anyway. And yep, beware of the cold starts of serverless functions!

Now imagine that the operation on the server took a little longer than expected by the client, and the client timed out, closing the connection. The server might successfully complete the operation, but won’t be able to send the response back, because the client isn’t waiting for it anymore. Tricky.

Summary

In this article, we expanded a simple one-liner remote call and learned the many ways it can go wrong. We worked with a simplified model: in reality, there are cables, routers, switches, firewalls, load balancers, and other devices in between the client and server, and any of these can fail. Luckily, it doesn’t happen very often unless you operate on a large scale.

Take the time to appreciate user-friendly error handling when using your favorite apps, because it can be quite tricky to get right.

This is just the tip of the iceberg, subscribe to read more about resilient distributed systems in the upcoming pieces.

Want to Connect With the Author?

Check out konarskis.com.

Join Medium with my referral link - Robert Konarskis

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

robertas-konarskis.medium.com