Handling errors like a pro or nah? Let’s talk about Dead Letters
In my previous article, I introduced asynchronous messaging. If you missed it, I invite you to read it to get a solid understanding of messaging before diving into this article.
We discussed how queues and topics can decouple two services — typically referred to as the publisher and the consumer. The publisher sends a message or event to a queue or topic, which is then processed by one or more consumers. However, we didn’t address the biggest drawback of asynchronous communication — handling errors.
In a synchronous HTTP request, errors such as bad requests, server downtime, or database failures are handled by responding with status codes like 400, 404, or 500. But in the asynchronous world, we follow the “fire and forget” model, where once the message is published, the client is disconnected, and we can’t notify them directly if something fails. Doing so would negate the benefits of asynchronous processing.
This is where the Dead Letter Queue (DLQ) pattern comes in. It’s a common practice in distributed systems and messaging architectures. A DLQ stores messages that cannot be processed successfully after a set number of retries — usually due to persistent errors. This allows us to isolate and manage failed messages, investigate the causes, and recover from failures, improving the resilience and reliability of the system.
In this article, I’ll focus on the methods I personally favor for handling dead letters. While there are other approaches, such as automatically altering messages (e.g., adding fields), I find those solutions to be more of a hack rather than solving the root problem. As engineers, our goal is to keep the system consistent and elegant.
What Are Dead Letters?
Dead letters come into play when a consumer fails to process a message, leading to an error. These messages are placed in the Dead Letter Queue, which can be done either manually through code or automatically when an unhandled exception occurs.
The question is: what do we do with dead letters? Do we simply delete them, or should there be a process in place to handle them and ensure consistency in the system?
Simply deleting them isn’t a good idea, unless you’re absolutely sure that the message isn’t needed anymore. A dead-lettered message means the consumer failed to fulfill its obligations. In a “fire and forget” model, the consumer is responsible for ensuring the message is processed correctly. When it fails, it’s an exceptional case that requires attention. In the next sections, we’ll explore how to handle these dead letters in an efficient manner while maintaining system consistency.
Automated Retries
One straightforward method is to configure the queue to retry the message a set number of times. However, the system must be idempotent, meaning that replaying the same message multiple times won’t lead to duplicate or inconsistent results. For example, you could configure the system to retry a message 10 times. If it still fails, the message is moved to the dead-letter queue.
Communicating with the Client (If Possible)
Although not strictly about handling dead letters, one way to prevent them is by notifying the client when processing fails. While this contradicts the idea of asynchronous communication, it can be useful in certain scenarios where the client needs to know the outcome.
Remember the bidding system we discussed in the previous article? . In that system, asynchronous processing was used to handle a higher request rate, but the user is still interested in whether their bid was accepted and persisted.
When the CreateBidEndpoint receives a request, it returns a 202 Accepted status, indicating the system has accepted the bid but not necessarily processed it yet. If an error occurs in the asynchronous processing, it’s a good idea to notify the user. In this case, WebSockets can be used to notify the user about any failures, asking them to try bidding again.
With this approach, if the bid fails to process, we simply notify the user, avoiding the need for dead letters. Raising an alert in this scenario is less helpful because by the time support gets involved, the bidding window may have passed, making DLQ handling unnecessary.
Alerting
Dead letters happen occasionally, and we need to be prepared for that. They are a signal that something is wrong, whether it’s due to the message format, infrastructure issues, or service faults. When dead letters occur, it’s critical to raise an alert. Ideally, you should be notified as soon as the issue arises, allowing you to investigate and fix it.
In large enterprise systems, this is typically handled by raising incidents, which notify the on-call support team. Tools like PagerDuty, Opsgenie, or Splunk can be used to monitor DLQs and trigger alerts when messages are dead-lettered. These tools will notify the on-call team, ensuring the issue gets immediate attention.
What About Topics?
Dead letters aren’t limited to queues — you can have dead-letter messages in topics as well. In most messaging systems, DLQs operate at the subscription level, meaning you can replay messages for each consumer individually. However, relying on subscriptions for dead-letter handling is not best practice.
A better approach is to use forwarding queues or failover queues.
- Forwarding Queues: Each consumer has its own forwarding queue. When an event is sent to a topic, it’s copied and delivered to each consumer’s forwarding queue. If a consumer fails to process it, the message is dead-lettered at the forwarding queue level, not at the subscription level.
- Failover Queues: If a consumer can’t process an event, it’s moved to a failover queue with alerting configured. The on-call support team will be notified to address the failure.
An in-depth analysis of forwarding queues vs failover queues will be covered in an upcoming article, so stay tuned.
Conclusion
Asynchronous communication with queues and topics isn’t a silver bullet. While it offers major advantages in performance and decoupling, it also comes with challenges — especially when it comes to handling failures. With the right strategies, like automated retries, alerting, and proper DLQ management, we can build systems that are not only performant and decoupled but also consistent and reliable.
Thank you for reading! If you found this article helpful, I’d appreciate your feedback!
Follow for more of these.