Handling Failure Successfully in RabbitMQ

RabbitMQ is a powerful, flexible message broker that is a great fit for many modern applications, enabling scalability and loose coupling between components. There are many excellent resources and tutorials available for adding RabbitMQ into your existing application and going on to great performance and success — when things go well. Today we’ll look at what sort of things can go wrong in a real world system using RabbitMQ, and how to design our applications to react appropriately if something does ever go wrong on the internet.

A Message Wasn’t Processed Correctly

The first question here is whether it matters if a message was dropped. If a message isn’t useful if it’s delivered late, then it is worth considering whether it’s important to detect or react to failures at all. If it’s not important, then the consumer itself can choose to “auto-ack” the message. In this case, the worker that consumes the message doesn’t need to acknowledge it; as long as the message has been delivered to a consumer then we stop worrying about it. We call this “at-most-once” delivery.

In other scenarios, we do want to detect and potentially react to message processing failures. By setting the auto-ack parameter on the consumer to false, we ensure that messages are removed from the queue only when a consumer has accepted the message and acknowledged that it was successfully processed (this setup is often referred to by the shorthand ack). If a consumer takes a message and does not acknowledge that it has been successfully processed within a given time window, then another consumer is given the message. This can lead to messages being processed more than once so it's important to make sure that your system can tolerate this outcome, and that the timeout settings are set appropriately for the length of time the worker is expected to need to process the message. On the plus side, this does give us "at-least-once" delivery guarantee as we can be sure that the message will get processed.

This Message Can Never Be Successfully Processed

Problems arise when “at-least-once” doesn’t have room for the fact that some messages will never be successfully processed. Perhaps they are invalid or intentionally harmful, or simply contain data that the worker doesn’t know how to handle. By always leaving them on the queue, the queue could become clogged up with these messages that can’t be processed but aren’t getting drained away. We can configure our queues to direct any rejected messages to a “dead letter” exchange; this allows us to inspect and potentially process later any messages which our existing workers didn’t handle.

If it’s possible to identify which messages can’t be processed and will never be able to be processed (for example because the data is missing or invalid rather than because a 3rd party API is not responding), then we can simply give up on them without any further processing. To do so, we send a reject response rather than our usual ack. Remember that if we simply fail to acknowledge — perhaps because the invalid message causes our worker to crash — the messages are automatically requeued. Instead we need to create defensive workers that react to failure situations appropriately. In the case of "poison" messages that can never successfully be processed, this means sending the reject response with the requeue option set to false so that the message will never be processed. If there is a dead letter exchange, the message is routed there and if not, it gets binned.

Configuring the dead letter exchange is part of the queue setup. If the main exchange is called “events” and the dead letter exchange is called “mishaps”, then creating and binding a queue called “notifications” would look like this:

./rabbitmqadmin declare queue name="notifications" 'arguments={"x-dead-letter-exchange": "mishaps"}'
./rabbitmqadmin declare binding source="events" destination="notifications" routing_key="notifications.v1" destination_type="queue"

The examples here use the rabbitmqadmin tool, but you could use the same approach in any of the language-specific libraries.

For messages which could potentially be reprocessed later (a good example is the unresponsive 3rd party API that could be experiencing temporary failure), it could be more appropriate to send the reject response to indicate that the message wasn't processed, but with the requeue option set to true. This puts the message back on the queue for another worker to collect later. But if the failure persists, these essentially become poison messages as well, because our workers will try to process, fail to process, put them back on the queue ... and repeat. RabbitMQ does not have built-in functionality for counting retries, but it is trivial to implement this yourself: when the message fails to process, but the failure could be temporary, create a new message exactly like the existing one, but add some additional metadata indicating that this is a retry attempt. (You can't edit messages which is why you need a new one.) Acknowledge the original message, and put the new message onto the queue. Then you can check the retries count each time the message goes to a worker for processing. If it fails, either create a new message with updated metadata or permanently fail the message by sending reject with requeue set to false. Again, if there's a dead letter exchange the message goes there, so we'll be able to see what didn't get processed.

Choosing Failure Strategies

Contemplating what we know about the status and experience of any given message can help to work out what the next move is when dealing with failure cases. The following diagram attempts to summarise the cases and some recommended reactions to different outcomes.

Identifying and dealing with failure outcomes in RabbitMQ

Thinking about whether the message should remain in the queue until it can be processed, or whether the system should just move on so that it services the immediate messages as best it can, is a very important step in working with queues. For example, if you have an application that sends both email notifications and a push message to the user’s browser, are those types of message equivalent in whether they can be delivered late? Does it matter if one gets missed? A common setup there would be to add retry logic to the emails so that they are sent eventually, even if they are delayed by a few minutes. If the push message to the browser fails, then the application may not bother to try to recover as the notification is less valuable if not delivered at the right moment.

Queue Is Too Full

Sometimes it makes sense to maintain very long queues. The main variable is more around the time a message typically spends in the queue rather than the queue length itself. Usually queues are pretty small, and only grow large if there’s a problem processing the messages. In this case, it is often useful to specify a maximum length of queue. Beyond this size, messages are either dropped, or sent to the dead letter exchange if there is one.

To specify a queue with a maximum length, use a declaration like this:

./rabbitmqadmin declare queue name="notifications" 'arguments={"x-max-length": 10000}'

When the queue gets longer than 10000 messages, the messages are discarded from the front of the queue to make room for the new messages arriving in the queue. If the queue also has an x-dead-letter-exchange declared then the discarded messages go there, otherwise they are just thrown away. This approach can be very useful to avoid RabbitMQ outgrowing its allocated resources and losing messages in an uncontrolled way.

Another option to keep queue length down is to set a TTL (“time to live”) for messages in a given queue — this is actually a property of the queue so it’s configured with something like:

/rabbitmqadmin declare queue name="notifications" 'arguments={"x-message-ttl": 60000}'

The units of the TTL are microseconds, so this command allows a message to remain in the queue for 60 seconds. After this time, if it hasn’t been processed, the message self-destructs! If there’s a dead letter exchange, it goes there, otherwise it is simply dropped. For queues where timely delivery is very important, this can be an excellent way of avoiding a situation where the queue quickly becomes larger than can be processed when something goes wrong. If you see a situation where the only real option is to clear the queue completely and let the system start again, then adding message TTLs can help you retain only the current information and bin the stale messages clogging up the queue.

What About These Dead Messages?

The important thing to understand about messages that end up at the dead letter exchange is that they are routed in the normal way. These exchanges are not special, they can be declared in the usual way as direct, topic or fanout exchanges, and you bind queues to them in the normal way. When you configure a dead letter queue, you can set the routing key that messages should use when being routed there. But if this isn’t specified, then the messages keep the routing key they previously had when they were routed through the “live” exchange.

Once messages are on the dead letter exchange and routed into queues, we can then process them as we normally would. In most cases, you may just choose to log the contents of messages that haven’t been processed, so they can be reported on later. However if messages have been routed to the dead letter exchange as a result of a problem that is later fixed (perhaps a queue overflowed or an external service was temporarily unavailable), one option is to write a one-off consumer to pick up those messages and simply route them back to their original destination.

RabbitMQ and Handling Failure

Queues enable us to write complex systems, and in those complex systems sometimes things do go wrong! This article covers some handy configuration options and architecture concepts that let you make the most of RabbitMQ features, so you can handle adversity gracefully. If you have other tactics that you’d like to share, we’d love to hear from you in the comments.