A tale of retries using RabbitMQ

Published in

Upstream Engineering

11 min readOct 30, 2020

A few years ago I came across an interesting question on StackOverflow regarding RabbitMQ retries and tried to answer it based on our recent experience of implementing such retries at Upstream. However, my answer only touched the surface of this interesting topic, so I wanted to take the opportunity in this article to go in much more depth, and thoroughly explain our approach of implementing retries within our Microservices architecture at Upstream.

Background

In 2014, when we moved our Core Platform to a Microservices architecture we decided to follow the asynchronous pattern of communication between Microservices using RabbitMQ as our message broker. One of the challenges however with Microservices is the fact that they form a distributed system by definition. There are a number of reasons a Microservice can fail to execute. Architecting systems to be resilient against failure is always a challenge. There can be many kinds of failures, such as concurrent modification errors (e.g. database optimistic locking errors), network, database or other storage failures, resource starvation, race conditions due to bugs, timeouts, etc. One layer of defence against such failures is retries. Retrying the execution of a failed Microservice however presents several challenges to address.

Who should retry

Firstly, who should be responsible for the retry logic? In our architecture we actually introduced two types of communication models between Microservices. The first one is a request-response model, which we implemented using promises (completable futures) on top of RabbitMQ. The second one is a broadcast or publish-subscribe model, where Microservices are more decoupled from each other, as events are published on the message broker (RabbitMQ in this case) and interested parties subscribe for these events to execute their business logic. Our implementation of retries is very different in those 2 types of communication. In the request-response type, the service performing the request has the ability to check the outcome of the request (through the promise/future completion) and is therefore held responsible for retrying, in case of failure. In the broadcast type of communication, event publishers have no knowledge of the subscribers and cannot track whether they consumed the event successfully or not. Moreover, an event may have multiple subscribers, one of which may have consumed the event successfully while another may have not. Therefore, in the broadcast type of communication, we put the responsibility of the retry logic on the subscriber.

What to retry

A second challenge to address is which errors to retry. Not every error is temporary. Ideally the retry framework should allow for a way to specify which types of errors are retriable and which are not. Another thing to consider is what should be the default when there is an unspecified error. Is it considered retriable or not?

Idempotency

Re-executing a piece of application logic can be tricky, as part of it may have been successfully executed and should not be executed again. This is traditionally handled by transactions, which take care of rolling back everything that has been completed up to the point of failure. Transactions however are not a lightweight construct, especially when they involve multiple resources, such as a database and a messaging system, which is the most common scenario in our Microservices.

Our approach was not to use Distributed Transactions (XA) in favour of application performance. Instead Microservices execute their logic in a transactional service layer which uses local database transactions, and any message emissions happen outside of the service layer, after the database transaction has been successfully committed. Retries happen only when database transactions fail, not when message emissions fail, as message emission is much less likely to fail. This eliminates the risk of executing logic twice, but has a very small risk of not executing some logic if a message emission fails. We do handle these edge cases too using an automated reconciliation process, but we’ll leave that for another blog post.

How many times to retry and how often

Retrying a failed message should not be infinite, otherwise it is characterised as a “poison” message, resulting in an infinite loop. Ideally the number of retries should be configurable. Also retries should not happen immediately. It is more reasonable to retry after some delay. The retry pattern we usually use in our applications is the “exponential backoff”, which means each retry is performed with a longer delay than its previous retry until a maximum number is reached.

How to use resources efficiently

Because high throughput is an essential goal of our Core Platform, resource utilization was an important factor in our implementation decisions. Retrying a message after some delay should not block the processing of another message. Threads should not be blocked. This means that the failed message should go back to a queue, waiting to be retried while other messages get processed. Consequently, message order is not preserved, but this wasn’t important in our general use case. On the other hand, we wanted to have the option to preserve message order wherever needed, trading off increased resource utilization.

RabbitMQ message redelivery basics

RabbitMQ provides some kind of support for retries, out of the box, but not with delay. When consuming a message the client declares whether to use auto acknowledgement or not. Auto acknowledgement means the broker can discard messages once delivered to the consumer (but before processing). Manual acknowledgement means the broker should wait for the client (consumer) to explicitly acknowledge when the message processing completed (successfully or not) and can be discarded. Note that there are no timeouts. The message will not be redelivered to a consumer, unless a negative acknowledgement is received or the connection is closed. When using automatic acknowledgements, retries due to application errors are practically not applicable. When using manual acknowledgements, a message can be redelivered by RabbitMQ if it is rejected by the consumer (meaning a negative acknowledgement is sent) and the flag requeue is set to true.

Spring AMQP provides a higher level of error handling, as it allows the message listener to throw an unchecked Exception to indicate failure, and Spring will take care of rejecting the message. Spring provides the following interface for listening for messages:

Note that when a message is rejected, by default Spring AMQP sets the requeue flag to true, which means that if the error is not temporary, this will result in an infinite loop of delivery — rejection. The default behaviour can be changed by calling setDefaultRequeueRejected(false) on the org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.

This will cause the rejected message to go to the configured Dead Letter Queue (DLQ) or to be discarded if no DLQ is configured. What we want however is to be able to retry the message in a controlled manner, i.e. to be executed for a maximum number of times with a specified frequency pattern.

Spring Retry

We were already using the Spring AMQP library for interacting with RabbitMQ so using the Spring Retry library for retries was the first thing that came to mind. After all, it does provide a very clean programming interface for declaring complex retry strategies.

For example, here is how you would set up retries with exponential backoff, with initial delay 1 second, multiplier 2 (each next attempt delay is twice as long as its previous attempt) and maximum interval 15 seconds or maximum 5 attempts.

With this setup, retries will happen within the message processing thread (using Thread.sleep()) without rejecting the message on each retry (so without going back to RabbitMQ for each retry). When retries are exhausted, by default a warning will be logged and the message will be consumed. If you instead want to send the failed message to a DLQ you will need either a RepublishMessageRecoverer (which publishes the failed message to a different Exchange/Queue) or a custom MessageRecoverer (which rejects the message without requeuing). In that latter case you should also set up a RabbitMQ DLQ on the queue as explained above.

Note that the above example used a stateless retry interceptor because all retry attempts happen without exiting the interceptor call, so the interceptor is called only once regardless of the number of retries. Any required state (e.g. number of attempts) is kept on the stack. If we didn’t mind for Threads to block, this would actually be a very convenient and elegant solution for the problem. However, we felt that occupying Threads waiting for retry wasn’t optimum resource consumption so we wanted the message to actually return to a Queue while waiting to be reprocessed.

Spring Retry also provides a stateful retry interceptor in which case the Exception is propagated outside of the interceptor. In other words the interceptor is called once per retry, therefore it needs to somehow remember the state of the retries (hence it is called stateful). The main use case for a stateful retry interceptor is so that it can be used in Hibernate transactions (through the Spring @Transactional interceptor), where if the Hibernate Session throws an exception, the transaction must be rolled back and the Session discarded. This means that upon failure the call has to exit the retry interceptor, so that the transaction interceptor can close the session and open a new one for each retry attempt.

We thought of using the stateful retry interceptor for our use case but keeping the state of the retries would get quite complex, especially given that our applications operate in multiple nodes of a cluster.

After spending some time evaluating Spring Retry for our use case, we got to a point where writing our own customized retry “framework” (a mini library) seemed to be worth pursuing.

Exponential backoff retries with RabbitMQ

Firstly we had to figure out how to implement delay in RabbitMQ. As mentioned earlier, RabbitMQ doesn’t support this functionality out of the box. At that time the only workaround for this was to use TTL (time to live) and DLQs (Dead Letter Queues). Later, RabbitMQ offered the Delayed Message Plugin which we started using for other use cases, but not for retries.

For the most complex of our use cases, which is exponential backoff with up to n retries of increasing amounts of delay, we set up n “delay” Queues with a number of surrounding objects as we will explain. We decided to have a single set of all these objects shared by all of our Microservices, to avoid the overhead of having a very large number of delay Queues. Therefore messages from all our Microservices are multiplexed over the same delay Queues and exchanges. In the following example, let’s assume that we need maximum 3 retry attempts, with delays of 5, 10 and 15 seconds.

**Exponential backoff retries using RabbitMQ**

We call these 3 Queues “delay” Queues because their sole purpose is to delay the delivery of a message by a configured amount of time. This is achieved by setting the x-message-ttl argument of the Queue to e.g. 5000, 10000 and 15000 milliseconds respectively. You can see the RabbitMQ definitions in JSON below:

“Delay” Queues

Also observe the argument x-dead-letter-exchange on these Queues. When a message is enqueued, it waits for the configured amount of time (since there are no consumers on this Queue) and when the TTL expires it is automatically forwarded to the configured x-dead-letter-exchange.

Next, we have these 3 Dead Letter Exchanges which we named dlx.delay.<num>. These Exchanges are necessary so that we can decide (via Bindings) which Microservice inbound Queue the message should be re-routed to (the original inbound Queue where the message initially arrived at). See the definition of these in JSON below:

Dead Letter Exchanges

For each one of these Exchanges, and for each Microservice, we also need a Binding. You can see the Binding definitions below for one Microservice:

Dead Letter Exchange Bindings for each Microservice

Finally, we have one Exchange called route.delay with its own Bindings that will route the message to the correct delay Queue based on the retry count. See the definitions below:

Exchange that routes to correct delay Queue

Bindings that route to correct delay queue

Let’s take a step back and review the message flow. Each of our Microservices has its own inbound Queue for incoming messages (e.g. queue.microservice1). When the processing of such a message fails, our retry library will set the routing key of the message to include the Microservice name and the retry count, and then send it to the route.delay exchange. This exchange will check the retry count on the routing key and forward the message to the corresponding delay Queue. When the time expires the message goes to the corresponding Dead Letter Exchange, which in turn will check the routing key of the message and send it to the correct Microservice inbound Queue for re-processing.

The obvious drawback with this approach is that we have a fixed amount of delay per retry attempt (meaning the delays are always 5, 10, and 15 seconds), but realistically that is good enough for our case. The maximum number of attempts is not limited to 3. We actually have a larger number of delay Queues, but some Microservices use 3, while others use more.

Single delay retries with RabbitMQ

Now that we know how to implement exponential backoff, we can easily implement the simpler case where we only need a constant amount of delay on each attempt. In this case, we only need one delay Queue with a specific x-message-ttl. We can set up the Microservice inbound queue with a x-dead-letter-exchange header and simply reject the message without re-enqueuing. RabbitMQ will automatically route the message to the delay Queue. Also, in this case, the message routing key is not modified and the retry count is derived by the retry library from the x-death header, which is automatically added to the message by RabbitMQ.

RabbitMQ consumer retry library

After setting up RabbitMQ for our retry use cases, we started implementing the retry “library”. The main component is a message interceptor, which we named AmqpRetryInterceptor. It implements AOP’s (Aspect Oriented Programming) MethodInterceptor so that it can be added as an AdviceChain in Spring AMQP SimpleMessageListenerContainer, much like the Spring Retry interceptor we saw above.

Then we have an interface called RetryStrategy, which determines what type of retry we are choosing. We have the following implementations:

InProcessRetryStrategy: implements retries the same way that the Spring Retry stateless intercept would (by using Thread.sleep()).
TTLBackoffRetryStrategy: implements retries by using a single RabbitMQ delay Queue as explained above.
TTLExpBackoffRetryStrategy: implements retries by using n RabbitMQ delay Queues as explained above.

Also we have the interface RetryPolicy, which holds the configuration of which errors are considered retryable and how long we should keep retrying.

When a client Microservice wants to use the retry library it may define a custom Exception upon which it wants to retry, e.g. an AmqpRetriableException. It instantiates a RetryPolicy and configures it to retry only upon AmqpRetriableException. It then chooses a RetryStrategy implementation and asks the strategy to configure the SimpleMessageListenerContainer with the specified retry policy. This can be seen in the code snippet below:

Conclusion

Implementing retries in our Microservices architecture, where services are connected through RabbitMQ required careful consideration of several factors. RabbitMQ was successfully utilized to implement the delay patterns that we wanted, using the feature of TTL on the Queues and thus allowing the application Threads to be used for processing of other messages. The mini retry library that we created was fun to implement and provided us with the level of control and configuration that we wanted over our retry strategies. Overall our approach has been successful so far as it is working on production for the past 6 years, serving thousands of transactions per second with no issues.