Architectural Patterns: Retry

Roman Kazarov
Gett Tech
Published in
8 min readAug 28, 2023

Retry: A Simple Way to Increase the Fault-Tolerance of Your System

This article is a logical continuation of the previous the Circuit Breaker. Today we will discuss a similar architectural pattern called Retry. This pattern is very simple in its logic and intuitive. The essence is simple: we send a request to a remote service, and in case the request fails, we try to make the request again. This pattern helps improve the reliability of your distributed system. This is achieved by properly handling errors according to the business logic of your application.

How it works

Suppose we have three services A, B, and C that make requests to a common overloaded service D. If a request from service A to service D results in a 408 Request Timeout error, it might be worth repeating the request, as this could lead to a successful result. However, it is important to remember that we cannot make an infinite number of requests, as this will increase the load on our service D. If the service is restored and ready to continue, previously sent requests may once again cause disruptions in the operation of service D. This can also increase wait times for the client, leading to timeouts on service A. For this reason, the number of repeat requests should be limited.

To avoid wasting server resources and reduce network load, a delay is added between repeated requests. There are also strategies for adding a random number to the delay time between requests.

Our distributed system should be built in such a way that our requests are idempotent, and re-sending the same requests does not lead to undesirable consequences. This can be achieved by applying the Idempotency key pattern.

It is good practice to handle errors before making a retry request. If there is no point in making a retry request, then it is better to avoid it. For instance, if there is a 404 Not Found error, it means that we are referring to a non-existent URI. Making multiple retry requests will only result in the same 404 error and is a waste of network resources.

Retry Strategies

There are many variations of the retry pattern and you should use them based only on the business logic of the project. These strategies can and should be combined because each of them isolates its own specifics, and as you know, real-world applications are much more complex.

Here are a few of the many such strategies:

  • Immediate Retries — this is a strategy where we immediately make a retry request. However, it is necessary to set the maximum number of retries in case of unsuccessful requests.To implement immediate retries with a maximum number of retries, you can use a loop that will retry the request until the maximum number of retries is reached or the request is successful.
  • Delayed Retries — with this strategy we should use the maximum number of failed requests in the same way as with “Immediate Retries”. But after each request, we set a delay in order to give time to recover the receiving server. This delay can be increased in a progressive manner so that the time between each retry attempt becomes longer, which can help prevent overwhelming the receiving server and potentially causing further issues.
  • The cancel strategy is a common approach to handle errors in which we make repeated requests to a server, and if we receive an error response that can be resolved by retrying, we repeat the request. However, if the error is not recoverable, or if repeating the request would not make sense given the current state of the system, we return an error and potentially cancel further attempts. This strategy can help improve the reliability and availability of a system, but it’s important to carefully consider the potential impact of repeated requests and ensure that they align with the goals and priorities of the project.

Ultimately, the choice of which strategy is used, or whether to use one at all, depends on the specific needs and constraints of the project. Factors such as the criticality of the requests, the performance of the receiving server, and the impact of potential errors must all be taken into account. When done correctly, the implementation of a retry strategy can greatly improve the robustness and dependability of an application, leading to a better user experience and increased customer satisfaction.

Practice

Let’s dive into code and write a simple implementation in Go that will clearly demonstrate the working principle of this pattern. I suggest combining two strategies: Delayed Retries and Cancel Strategy. That is, we will write a function that takes a callback and executes it. If we encounter an error during execution, we will handle it. If we know this error and there is no point in making a retry request, we will interrupt the execution of our method and return an error. However, if we do not know this error, we will delay and then increase the time for the next delay by 20 milliseconds and retry executing our callback. We will repeat the execution of this callback a specified number of times, which in my example corresponds to five.

The source code of my implementation can be found in this github repository.

package retry

import (
"context"
"fmt"
"time"
)

type RetryAction func(ctx context.Context) error

type Retry struct {
attempts int
delay time.Duration
}

func New(attemps int, delay time.Duration) *Retry {
return &Retry{
attempts: attemps,
delay: delay,
}
}

func (r *Retry) Execute(ctx context.Context, action RetryAction) error {
for i := 0; i < r.attempts; i++ {
err := action(ctx)
if err == nil {
return nil
}

if err.Error() == "401 Unauthorized" || err.Error() == "404 Not Found" {
return err
}

select {
case <-ctx.Done():
return fmt.Errorf("retry canceled: %w", ctx.Err())
case <-time.After(r.delay):
r.delay += 20 * time.Millisecond
}
}
return fmt.Errorf("after %d attempts, last error", r.attempts)
}

This is very straightforward. We create a Retry structure that has two fields: attempts and delay. The attempts field sets the maximum number of retries that we will allow before giving up on the request. The delay field sets the delay time that we will wait between each retry. We will increase this time later to prevent overloading our service with too many requests. This way, we can handle temporary errors or network issues more gracefully.

The New method is basically a constructor that accepts two parameters — attempts and delay. These parameters specify the maximum number of retries and the delay time between each retry. The method returns a pointer to a new Retry object that has been created and initialized with the given parameters. The pointer can be used to access the fields and methods of the Retry object.

The Execute method is where the actual work happens. It receives a context and a callback as parameters, which we will run in our method. We start a loop that will iterate the number of times specified by the attempts parameter. At the first stage, we run our callback. If the execution result has no error, we exit the loop and return no errors. But if we get an error from the execution, we first check if it makes sense to retry our callback. If we get a 401 or 404, we know that repeating the same request with the same parameters is futile, so we stop the execution. At the last stage, we check the context status; maybe the operation was canceled from outside. If there is no cancel signal, we wait for the time set by the delay parameter. Then, we increase the delay time by 20 milliseconds (we increase the delay time between calls on each iteration to use network resources more wisely) and continue our loop. If our loop finishes, it means we have exhausted all our retry attempts, so we return an error indicating that the maximum number of attempts has been reached.

package main

import (
"context"
"fmt"
"time"

"github.com/KRR19/retry/retry"
)

func createAction() retry.RetryAction {
c := 1
return func(ctx context.Context) error {
if c == 5 {
fmt.Println("200 OK")
return nil
}

c++
fmt.Println("408 Request Timeout")
return fmt.Errorf("408 Request Timeout")
}
}

func main() {
a := createAction()
r := retry.New(5, 100*time.Millisecond)
ctx := context.Background()
r.Execute(ctx, a)
}

I decided to write a test callback as simply as possible — a function (closure) that constantly returns an error 408, but if we call it for the fifth time, it returns a message indicating successful execution.

A main function creates an action that will be executed. We also create an instance of Retry with 5 retries and an initial delay of 100 milliseconds, as well as an execution context. In the last line, we simply call the Execute method on our instance and pass in the context and action.

Conclusion

In conclusion, the Retry architectural pattern is a simple yet powerful way to improve the reliability of distributed systems. By intelligently retrying failed requests, we can handle transient issues and service disruptions in an elegant manner without major redesigns. However, care must be taken to not overload systems with too many retry attempts and to handle non-transient errors appropriately.

There are several retry strategies that can be employed based on the needs of the system. Combining multiple strategies, such as delayed retries with a cancelation strategy, allows building sophisticated retry logic that is tailored to the specific domain. Implementing retry logic in a generic, reusable way, e.g. through a retry utility library, promotes consistency and simplifies adoption in projects.

The sample code shown demonstrates a basic retry utility implementing a delayed retry strategy with cancelation. It allows retrying a callback function a configurable number of times with increasing delays between attempts. Non-transient errors are detected and immediately returned without further retries. The example also highlights that retry logic should only be applied to idempotent requests to avoid undesirable side effects.

In summary, the Retry pattern is a must-have tool in the toolbox of any distributed systems architect. When used judiciously, it helps to build robust and fault-tolerant systems that can handle the challenges of distributed environments. Understanding different retry strategies and how to apply them to specific use cases is key to reaping the benefits of this architectural pattern.

--

--