Kafka Resiliency — Retry/Delay Topic, Dead Letter Queue (DLQ)

With Confluent.Kafka .Net Client

Sheshnath Kumar
5 min readJun 7, 2020

In my previous article on Kafka, I walked through some basics around Kafka and how to start using Kafka with .Net Core. Now we’ll have a look at how to setup Retry/Delay topic, and DLQ.

Microservices Architecture contains smaller, independently deployable and maintainable services. A single error won’t affect the entire system. But, completion of a single task may need to connect with different services spread across. For example, airline ticket booking may need to connect with different services like Customer, Reservation, Payment, Campaign, Account etc. Single failure may result in inconsistency or rollback/compensation of the entire operation. We should not mark booking complete if account posting failed due to a transient failure and/or other dependencies.

Retry enables an application to handle anticipated, temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that’s previously failed. We can categorize retry of two types: Hot Retry, and Cold Retry. Hot retry is something we try immediately when there is any failure. Like we can try again when REST API calls fail due to transient failure or other dependencies like required resources were unavailable that time. Cold retry is to try the operation later because failure may need some time to get resolved. We’ll discuss different cold retry approaches in detail using Kafka message bus.

Using Retry Topic

Publish failed messages to retry topic(s). Retry topic consumer will consume this messages and after defined delay, publish message to original topic.

  • Messages in a partition are sequential and can be consumed in the order they are added. But at the same time, you can navigate through logs using Seek API to move offset backward or forward.
  • If you want to postpone/delay processing of some messages (due to any processing error, dependency or any other reason), you can republish them to separate topics, called retry topic and commit the offset. This way, you can avoid blocking the main topic where the message originally produced.
Publish Message — Single Retry Topic
  • Before publishing to the retry topic, ensure to add some metadata about the message like Remaining Retry Count (decrease with each retry), Consumer Group Id (message should be processed again only by failed message group), and Next Retry Timestamp etc.
  • Many strategy could be adopted for retry topic like single retry topic, multiple retry topic having exponential delay, each topic has corresponding retry topic and so on. It all depends on the scenario we have and how we want to handle.
  • Multiple retry topics can be used to add exponential delay to the messages which are failing again (see below diagram). This mechanism follows a leaky bucket pattern where flow rate is expressed by the blocking nature of the delayed message consumption within the retry topics.
  • Once all configurable retries are exhausted, put that message to Dead Letter Queue (DLQ) as you should not keep it open and retry infinitely.
  • There could be different approaches for DLQ like publish message to DLQ topic which can have a separate consumer that consume and can take required action. Another approach could be to persist this message in DB.
  • Later you can have a mechanism like notification or reporting tool to report these messages and take required actions.
Publish Message — Multiple Retry Topics
  • Consumers of retry topics should pause the topic partition unless it is time to process the message as per the retry timestamp.
  • Once next retry timestamp arrives, consumers should resume the paused topic partition and commit the store offset of the current message to advance the committed offset.
Retry Topic — Consume And Replay Message
  • Messages in retry topics are naturally organized in the chronological order, sorted by the order; message was produced.

Using Separate Consumer Group

This approach is to have a separate Kafka Consumer Group which will fetch only failed messages and replay them to process again by respective message consumers.

  • Persist failed messages metadata along with other details like topic, partition, offset, remaining retry count, consumer group id, next retry time etc. in DB.
Message Retry — Persist In DB
  • Commit the offset of the failed message to avoid blocking the main topic.
  • Add a scheduler job who can pick failed messages having next retry time passed from DB.
  • Subscribe with Kafka as a separate retry consumer group. It will ensure no impact on existing consumer groups.
Message Retry — Consume Message Using Separate Consumer Group
  • Assign topic partition and seek offset based on the message metadata retrieved from DB.
  • Consume message and check if this message should be retried based on message metadata. If yes, update message metadata like remaining retry count, consumer group id and publish again on the original topic.
  • Remove this offset entry from DB and unsubscribe with Kafka.

Using DB

Another approach is to persist failed messages in DB. There will be a scheduler service which will pick failed message from DB based on filters like remaining retry count, retry time etc. Then scheduler will publish the same message to original message topic and update back the status to DB. This way we can avoid having separate Kafka topic to retry failed messages.

Retry Message — With Scheduler
  • Persist failed messages to DB along with message metadata like consumer group id (message should be processed only by failed message group), original topic (to replay message again) etc.
  • Scheduler service will have configurations like no. of retries, retry delay, exponential delay etc. There could be Topic wise configuration as well in case any topic has special need.
  • Scheduler will fetch messages based on filters like remaining retry count more than zero, retry time etc.
  • Then scheduler will publish this message to original message topic and update the message status in DB.
  • In case message failed again, update message metadata having remaining retry count (decrease with each retry).
  • Once all configurable retries are exhausted due to continues failure, mark message as Dead (remaining retry count 0).
  • Later we can have a mechanism like notification or reporting tool to report these messages and take required actions.

Each Approach have pros and cons. Apply based on the particular use case you have.

Keep learning! Thank you!!

--

--

Sheshnath Kumar

Cloud Solutions, Distributed Systems/Microservices, Gen AI