SRE: Resiliency: Retries in Action — Availability in Exchange for Latency

Published in

Dm03514 Tech Blog

7 min readJan 4, 2019

Retries are a core resiliency pattern which help enhance service availability by re-attempting failed operations. Retries are commonly found pattern which many libraries (such as the aws-sdk) regularly employ. This post shows how retries can be used to enhance service availability and the latency tradeoff that they incur.

What are Retries?

A retry is just a repeated operation.When an error occurs during an operation a retry repeats the operation. Retries are usually combined with some sort of “backoff strategy” which provides a timeout between operations, in order to prevent a resource from being overwhelmed.

The state diagram below shows retry states and flow common to many different libraries:

An operation is attempted, if it succeeds the result is returned, if a failure occurs the retry checks to see if it should perform the action again. If it shouldn’t an exception is raised if it is a timeout is applied (usually) and the operation is attempted again. Retries are a pattern which allow services to increase availability at the expense of increased latency.

Why Use Retries?

Many classes of errors (network, application) are transient and rooted in network or server overload. These errors are ephemeral and are usually quickly resolved. If latency of a retry can be tolerated, it allows for increased availability at the cost of increased latency. Retries allow a client to offer higher availability than its dependencies. In order to illustrate this consider a service (“Service”) which has a dependency on another service (“Service Dependency”) that offers an SLO availability of 99%.

Because of this the client has to assume that 99 out of every 100 requests will fail. If the service is making 1000 requests per second this results in 10 failures per second! If the operation is retried 1x it would require two errors to occur in order for the request to fail 0.01 * 0.01 percent chance! 0.001! 1/1000! (This makes the huge assumption that there are no other sources of errors and there are no correlation between errors)

With a single retry the rate of success for service calls of the main service goes from 99% to 99.9%!! Higher than what the SLO the Service Dependency offers! With a second retry that goes to 0.0001 (0.01 * 0.01 * 0.01) or 99.99%! Ignoring other sources of errors, the service is able to offer closer to a 99.99% SLO. While the real world can’t offer this exact math, retries are able to increase the availability some amount.

How?

In order to illustrate retries in action the service scenario from above will be modeled to show how retries can allow a caller to offer higher availability than its dependencies offer and insulate callers from failures. To do this resilience4js will be used. All code used to generate the test data below can be found here.

A dummy service will be used to model 99% availability:

class DummyService {
    get() {
        return new Promise((resolve, reject) => {
            // 99 / 100 times return a good response
            const num = getRandomInt(100);
            if (num === 0) {
                reject(new Error('failed'));
            }
            resolve('success');
        });
    }
}

Next the retry policy will be configured:

const retry = resilience4js.Retry.New(
    'dummy_service',
    resilience4js.Retry.Strategies.UntilLimit.New(
        resilience4js.Retry.Timing.FixedInterval.New(50),
        3,
    ),
);

The retry is give a name dummy_service which will be emitted as a label in the metrics. The strategy being used is UntilLimit which will make a fixed number of attempts (`3` in this case). Each attempt will be made in an interval of 50ms. Another common backoff timeout is Exponential Backoff. Some libraries, such as Polly, allow for configuring the specific types of errors to be retried. Currently, resilience4js will retry on any error that the retry policy is executing.

After creating the retry policy it needs to be configured with an operation to retry. In this case it is the service dependency call from above:

const service = new DummyService();
const wrappedGet = retry.decoratePromise(service.get);

The operation will now retry up to 3 times using a 50ms timeout in between each failed operation.

Each of the following test cases will apply load at a rate of 500 requests / second using vegeta:

$ echo "GET http://localhost:3000/" | vegeta attack -rate 500 -duration=0 | tee results.bin | vegeta report

No Retry

The first test is performed without any retries:

The left chart above shows the server reported HTTP rate and the status codes returned. ~495 (99%) of Requests are succeeding while ~5 (1%) are failing. This is also reflected in the chart above on the right. The availability is calculated using rate of successful requests / rate of all requests. This shows an availability of 99%.

The chart below shows the request latencies during this time:

Without any resiliency the service can’t offer a higher availability than its dependencies (and may have to offer an even lower one to factor in transient network requests).

1 Retry

The next test configures a single retry:

With a single retry the rate of failure is now 1% * 1% or 0.001. The graph in the upper left hand reflects this with 99.9% of requests succeeding (1/1000 or 499.5 / 500). The availability in the right hand graph above reflects the new, higher, availability that is achievable because of the retry.

Additionally, resilience4js emits retry specific metrics, showing how many times retry policy was invoked and whether it decided to perform the retry or not:

The chart above on the left shows that there is a retry rate of ~5 retries per second. These are the operations that are occurring because of the 99% success rate of the service dependency (500 requests / second * 99% service success rate = 5 failed requests). The chart above on the right hows the retry decision. The green line representing “false” shows that some of the requests are still failing and the resilience4js retry is reaching its max (there should probably be a more explicit resilience4js metric to reflect this).

3 Retries

The final test configures the service call to retry up to 3 times.

The graph above on the left shows that the service is no longer reporting any non 200 status codes! This is reflected in the chart above on the right shows that the service is now offering 100% availability rate, which is higher than its dependency!

The graph above on the left shows that up to 2 retries are being used. And the graph on the right above shows that all retries are completing successfully without their maximums being exhausted. The latency graph below illustrates the cost of the availability that retries enable:

The tail latencies (p99, p100) have increased from (94ms, 100ms), with no retries, to (99ms, 250ms) using 3 retries. The p90 and p50 are similar.

When to Use?

Because of how effective retries are they should be used whenever their latency overhead can be tolerated. Even though retries are extremely simple, they have a couple of caveats which could actually reduce availability or cause large failures if not handled correctly.

Caveats

Unlike bulkheads retries have the potential to cause major availability issues. Retries need explicit limits (ie maximum attempts) or they could repeatedly hammer a service. Combined with explicit limits, some sort of backoff policy should be used. Consider a service that is overloaded. If the service results in a transient error retrying immediately may still catch the service in the transient state. There are a number of different retry policies but Exponential Backoff may be the most popular. In addition to a backoff policy, retries should be combined with some randomness ie Jitter. Jitter adds a random amount of time to each operation which prevents multiple clients from retrying with the exact same frequency.

Without explicit limits, backoff and jitter retries can exacerbate service issues. Consider a service is overloaded or flapping. If clients aren’t configured with limits, backoff, and jitter, when multiple clients encounter the service error they will immediately retry. If many, 100’s or 1000’s of, clients are retrying in perfect frequency before a service has time to reach full capacity in can create a state where the service is unable to ever recover.

Retries are a powerful resilience primitive which allow a client to offer higher availability than its dependencies. Retries should be used with care since an unrefined retry policy could cause result in a denial of service -like attack.