Best Practices on API Retries

Majnun Abdurahmanov
3 min readFeb 19, 2024
Tallinn

Retries are a fundamental resiliency technique that helps improve service availability by attempting failed operations again. They allow applications to handle temporary failures. Temporary faults are typically recoverable within milliseconds to a few seconds. Some common examples include rate limiting by cloud services, network loss, and timeouts.

If we don’t retry for temporary errors, we risk passing the error back to the clients, which can frustrate end users and potentially lead to customer attrition. By now, we understand the importance of retrying for temporary errors, so let’s explore some of the recommended practices.

First, it’s important to understand that the Failed operation is appropriate for Retry

  • Check the response error/status code and verify against the documentation to determine whether retrying can help resolve it or not. If it can, then retry.
  • Avoid retrying if the downstream service is overloaded.

Consider Exponential Back-Off for Retries

In most cases, it is recommended to use exponential back-off for retry attempts. Exponential back-off involves increasing the time interval between each retry attempt. For example, if we choose to have 2 retry attempts, with the first attempt after 500 ms, the second attempt should be scheduled after 750 ms.

Determining the number of retry attempts and the interval between them.

  • Avoid retrying indefinitely. Solving the problem of determining the number of retries and the interval between them can be challenging. Instead, use a finite number of retries. If the retries fail, implement a circuit breaker for the downstream service to recover.
  • When the operation involves user interaction, it is best to minimize the number of retries. It is recommended to have 3 to 5 retries with an interval between them ranging from a few milliseconds to a few seconds. This ensures that the entire process is completed within a few seconds.
  • Occasionally, the failure responses provide the accepted retry policy header for the APIs. If this header is available, make sure to utilize it. Additionally, always check the error/status code of each response, as the downstream service may start returning irrecoverable errors.
  • Make sure that retries are not performed on operations where consumers are waiting for a response. Your service should be aware of the acceptable timeout set by the clients. You can check the request header to obtain this information. If it is not available, try to come up with a reasonable value. Also, ensure that the total time spent on retries does not exceed this timeout duration.
  • Taking a holistic approach is crucial when implementing retries. It is important to avoid introducing retries at multiple levels, as this can lead to cascading failures and ultimately degrade the service. For instance, if the database is unresponsive, both the backend and frontend should not retry more than 3 times, resulting in a total of 9 retries. It is essential to avoid this anti-pattern.

Monitoring Log Retries and Operations

  • Record details of exceptions, fault codes, retry attempts, API processing time, and error messages.
  • Set up a telemetry and monitoring system that can send alerts if the number and rate of failures exceed the specified limit.

Below is a sample code snippet in Java that uses the Spring Boot framework and the Resilience4j library.

private final RetryConfig retryConfig = RetryConfig.custom()
.maxAttempts(5)
.intervalFunction(IntervalFunction.ofExponentialBackoff(5000, 3))
.retryExceptions(
SomeNetWorkFailureException.class,
RateLimitHitException.class,
SocketTimeoutException.class)
.build();

private final Retry retry = Retry.of("third-party-api", retryConfig);


public void someMethod() {
try {
return retry.executeCallable(() -> {
try {
return thirdpartyapi.call();
} catch (IOException e) {
throw e;
}
});
} catch (Exception e) {
log.error("message", e);

throw new ThirdPartyApiException(e);
}
}

--

--