Be a good client: retries

Júlio Zynger
4 min readMar 11, 2020

--

When a request to a server fails, it is very tempting to issue another try to the same route, as a way to increase reliability of systems or mask transient problems to the end-users.

How often is this a good idea? Never. Well, more or less.

Today, many RPC libraries provide developers with built-in functionality for retrying requests, making it really easy to hammer servers with more and more load. Thus, retrying requests by default is not a good idea, instead, one must consider the situation and only then decide whether a retry is worth it.

More jitter

In the event of a transient outage, or full unavailability, retrying clients will increase the load on the server, potentially causing an overload in what would otherwise be a regular number of requests, causing even more clients to issue retries. This snowball effect is what is effectively alleviated by introducing jitter into the system, not only before the initial request, but also in subsequent retries.

Exponential backoff

There are several techniques that an engineer can employ to further relieve stress on their servers, of them, exponential backoff is the most popular. In a nutshell, this means varying the amount of time between retries instead of having a fixed time window.

func sync() {
val jitter = random(0.5, 1.5)
val back_off_wait = 0.5 // waiting time between requests
wait(jitter.to_seconds)
success = perform_network_request() while !success {
wait((back_off_wait * jitter).to_seconds)
success = perform_network_request()
back_off_wait *= 2 // increase waiting time per attempt
}
}

Circuit-breaking

Making the waiting time longer would help even out the load in the event of a transient server outage, but very rapidly this pseudocode above would cause clients to wait an unreasonable amount of time to then perform their follow-up request.

Not only this is a bad idea from a UX perspective, it would also mean that all of this load would be then deferred to a later point in time in which the server would already be correctly functioning, and possibly the user isn’t interested in its outcome anymore.

So, besides backing-off, one would also define a circuit-breaking mechanism to prevent useless or unwanted retries. In this case, a maximum threshold for the waiting time between retries before informing the end-user of the failure.

func sync() {
val jitter = random(0.5, 1.5)
val back_off_wait = 0.5
val max_waiting_time = 10 // circuit-breaker
wait(jitter.to_seconds)
success = perform_network_request() while !success && back_off_wait <= max_waiting_time {
wait((back_off_wait * jitter).to_seconds)
success = perform_network_request()
back_off_wait *= 2
}
}

Error codes and HTTP

There are cases in which a client can be assured a server will not be capable of fulfilling a request, which we can account for. In fact, HTTP defines codes that one can use to help decide whether to issue a retry, becoming part of the circuit-breaking condition.

Clearly, one should not retry requests that failed because of client errors (for example, a missing query parameter, or incorrect Authorization header). Independently of the load on the server, these requests would result in the same errors if retried.

It is, however, potentially a good idea to retry both network errors and server errors. The first could happen for a temporary fault in connectivity and possibly recovered later on, while the latter could ensure recuperation when the transient outage is resolved or the load is relieved.

The server in command

For a few specific status codes, headers can also be used to pass additional information to optimize communication timing and reduce load. In the event of rate limiting (429), a known interruption of service or overload (503), the Retry-After header can be used to indicate how long the client should wait before making another request.

Seen as a hint, it is still responsibility of the client application to change its behavior given the presence of that header. In fact, many modern browsers already support its detection, and even Google’s crawler is aware of it to determine when to visit websites.

This mechanism is especially interesting since having a dynamic value means moving the control of the request streams to the server, allowing for it to adjust accordingly to its current incoming load. In fact, all of the techniques mentioned above could also be applied by having remotely-fetched values, from the threshold of retries to the amount of jitter per request. The client, then, becomes responsible only to define sensible defaults.

When dealing with I/O and networking, the one certain thing for every engineer is that there must be error handling logic in place. Even better, when possible, we can pretend errors never happened by gracefully recovering! Being conscious of how clients and servers talk, and keeping in mind the trade-offs within our infrastructure, the smarter we try, the better.

There’s even more we can do to make our client-server relationship a healthy one. So far, we have looked into techniques for full outage scenarios, but being a good citizen also means behaving well when partial failures are ongoing or ideally to prevent them altogether. We’ll dive deeper in these in a future blog post.

--

--