Stop Retrying API Calls!

Shonn Lyga
PoZeCode
Published in
2 min readJun 19, 2019

TL;DR — Recently I attended a design review where it seemed like retrying API calls is a common good practice. In this short article I want to explain why this approach should always be questioned, and how this approach can lead to a large outage of your whole stack.

When designing a component in a distributed system, the question I always ask when someone mentions their retry approach is:

“What is the root service in that call chain?”

There are 3 possible answers:

  1. I don’t know (never a good answer)
  2. My service is the root service
  3. Some other service is the root service

If you are blindly implementing retries to your dependencies, the only acceptable answer here is 2. Which means your service is the one initiating a call-chain, and is the one that should manage the retry policy.

“What is so bad about retrying if I am not the root service?”

I’ll explain with an example.

Let’s assume you have 3 services where A calls B and B calls C (look at the diagram above).

Let’s also assume that service C starts experiencing internal errors.

Service C is in fact a fleet of 5 machines behind a load balancer, and every machine can service 5 API calls a second. So overall the full functional fleet can service 25 requests per second.

Now what happens if service A makes a call to service B that makes a call to C and the call fails due to internal error in C? Let’s assume both A and B implemented a 5 reties strategy. Now A will retry 5 times, and B will retry 5 times for every call from A. Do you see where this is going?

Let’s look at the flow of events:

  1. A calls B and the call fails
  2. A retries 5 times
  3. B calls C and the calls fail
  4. B retries 5x5 times (5 times for every call from A)
  5. C tries to recover and browns-out

Service A will retry 5 times, for each of those calls B will retry 5 times. This adds up to 25 calls to C that is not even able to recover from its failure because we keep hammering it with a large amount of retries. Inevitably you get a brownout of service C.

When the fleet of service C tries to recover, it brings up 1 machine that can take 5 calls, and immediately goes down because it is being DOSed with 25 calls from service B. And the loop goes on and on until your whole stack is browned-out.

So what do you conclude, Dr. Watson?

If you have a distributed system, and your service is in the middle of the call-chain, you should think twice before automatically retrying failed calls.

Shonn.

--

--