Kafka retries and maintaining the order of retry events

Published in

Altitude

9 min readJan 21, 2022

Retries are important especially in the microservices system where those services must often collaborate in order to handle a request. What’s happen if a service goes down for just a couple of seconds? other services should throw an error to the customer or retry many times before giving up.

Let’s give a simple example: services calling in a chain via http:

If service C is down then a simple way is service B throw an error to service A and service A returns error to the customer.

But if service C is down in a couple of seconds then the better way is service B retry in 2s, 4s, 8s… and if fortunately, 3rd retry will be a success and end result to the customer still a success:

So we can immediately see the benefits of retries in a simple example. But with an event-driven architecture where service B receives events from the message queue as Kafka then retry strategies have more complexity. In this article, we will know why and find a way to solve it.

Problem

I will reuse the problem that is written in my previous article

For a short overview, Altitude is a hotel digital platform that runs on large microservices systems. One of the features is automatic key management: when a reservation check-in, RFID cards and mobile keys will be created, when a reservation changes room or update the departure date then all keys will be updated too. All run automatically.

We use Kafka as a stream processor, which basically means Kafka connect source listen change in reservation database, publish to Kafka topic and our services will consume those events. If you care about the details of system design with Kafka then you can read the previous article again.

This architecture has been simplified for the purpose of this article. Please focus on key service, it is responsible for consuming events reservation-check-in, reservation-check-out, reservation-update from Kafka, do some logic and request to the key server that runs as a windows app on our customer’s pc or cloud.

We manage key service, key server windows app is managed by the customer that takes up most of our concern. It runs on a single instance depending on the customer so it could go down just a couple of seconds by networking issues, windows crash, etc. Especially in an automated environment where no one sees that error returned until they check logs. So using a retry policy is important.

Simple retry

Let’s say we’re going to implement a backoff strategy that retries in the following intervals: 2s, 4s, 8s, 16s.

The first way we can think is loop retrying at consumer: when request fto key server failed then it will wait for next 2s before sending the second request if request still failed then it will wait for 4s before sending next request etc.

Look quick and simple? But there is a problem with this approach: because Kafka does sequential events processing so if event 1 is in retrying then event 2 is still in the queue. In the worst case, event 2 must wait for 16s to be processed. And most of the time it’s meaningless for waiting.

Non-blocking retry

It’s clear that we should process event 2 while event 1 is retrying. So a better way is to put event 1 to a different topic and continue process event 2.

When the key service processes event 1 and failed, it will put event 1 to a retry topic and continue process event 2. Retry topic will be consumed by another consumer that processes event 1 and mimics logic as a key service to send to the key server.

If we just retry once then keeping one retry topic is fine. But we are doing a backoff strategy that retries in the following intervals: 2s, 4s, 8s… So if the first retry of an event is failed then the next retries of this event must be implemented in retry consumer and this loop blocks other events in the retry topic (same weakness with first approach).

So we should create more retry topics, each corresponds to an interval time:

If the first attempt at first retry consumer fails, it will send the event to the next retry topic and continue processing other retry events.

When all retries failed, the last retry consumer will send the event to the dead letter queue topic to view or reprocess manually if needed.

Maintain order of retry events

Order of events in the same reservation is important. Back to our requirements, a reservation will have more events that should run in sequence: check-in, update departure date, check out. If we can’t keep this order when retry then we have trouble, let’s see why:

A reservation generates two 2 events: event 1 for check-in, event 2 for the update departure date. When event 1 is reached to key service consumer and processed failed then key service send it to retry topic 2s. Then key service consumer continue process event 2 for the update departure date. But reservation hasn’t been checked in yet so the key service will reject event 2.

So if events are in the same reservation then when event 1 is retrying then event 2 should default to put to retrying flow too.

We can use storage to keep states of all events retrying. If an event in reservation A is in retrying flow then all other events in reservation A will be a move to retry topics immediately without processing in consumer. This flow includes steps:

The main consumer (key service) sends a request to the server.
The request is failed by timeout, internal server error… and need to retry.
The main consumer mark event 1 is retrying in storage.
The main consumer send event 1 to topic retry_2s.
The main consumer commit with Kafka is event 1 is processed.
Kafka sends event 2 in the main topic to the main consumer. The main consumer checks event 1 is the same reservation and retrying.
The main consumer mark event 2 is retrying in storage and sending event 2 to topic retry_2s.
The main consumer send event 2 to topic retry_2s.

At each retry consumer, we mimic checking the previous event retrying as the main consumer: if event 2 is coming to retry consumer 2s and event 1 is in retry topic 4s then even 2 will not be processed at consumer and continue moving to retry topic 4s.

So what’s happen if event 1 is processed success, it will be removed from retry storage so event 2 can be handled by consumers.

If event 1 is processed, event 2 is still in retrying flow, event 3 comes to the main consumer and it’s the same reservation with 1 and 2. Main consumer will check event 2 is existed in retry storage and move event 3 to retry flow.

Of course, if event 3 isn’t the same reservation then it will be processed as usual.

Most of the time consumers will read from retry storage, just write when something fails. So it is more read-heavy than write-heavy. We can use Redis to improve speed.

Implementation

This flow is complex and we want to hide this complexity by writing once and shipping it as a library. I built an npm package and use it within my company. Unfortunately, I can’t be public by our policy but I will talk about it in this article with some pseudo-code.

Before starting, better if we can see how this library is used:

I use KafkaJS to work with Kafka, let’s see input for this library:

kafkaConfig Kafka configuration to connect. Basically, it is broker address and client id.
consumerConfig Kafka consumer configuration. It will be used for all main consumer, retry consumers…
retryDelays Interval times to retry in milliseconds. As above script then we will retry after 3 minutes, 5 minutes, 7 minutes. Correspond with this time, we will create a retry topic with naming is created by combine name of main topic and retry interval time originTopic-retry-18000
getGroupId Extract group id from an event. All events has same groupId will be kept orders. For example, all events check-in, check-out, update-departure-date of a reservation will have groupId is reservationId. So them can be processed by sequence when there is retry.
consumeMessageCallback Callback is called when an event is consumed by main consumer or retry consumer. Put your code here to process an event. To trigger retry flow for an event, you must throw an exception in this function. We shouldn’t throw all exceptions. For example, a http status code 400 shouldn’t retrying but a timeout retry should be

movedToDLQCallback Callback when an event is retried whole times but failed so it must be moved to DQL topic. It’s an optional. You can put code here to send email, notification if need but do not write events to retry storage because it should be implemented in repository
repository Used for retry storage. It’s an interface where you mark an event is retrying to storage. You can implement this interface to any kind of storage as Mongodb, Redis…

Look simple for a complex flow, right? We will go into detail to know what’s thing run inside.

When we initialize KafkaReprocessing object with above config, it will create:

One Kafka main consumer and three Kafka retry consumers for each interval time 3 min, 5 min, 7 min
One Kafka producer to consumers publish events

When main consumer process a message and failed then it will attach 4 headers to message:

x-altitude-retry-message-id id of message is used for repository, unique for each message.
x-altitude-retry-attempts count of retries, increasing with each attempt failed.
x-altitude-retry-timestamp-ms epoch time for retry event, when retry consumer consume an event then it will check this header and waiting until retry time.

When main consumer process event failed, it will add those headers to Kafka message and publish to first retry topic. First retry consumer will check header x-altitude-retry-timestamp-ms and wait until this time to retry as well as increase header x-altitude-retry-attempts to 1.

If retry consumers check header x-altitude-retry-attempts reached to max attempts then it will publish message to DLQ topic.

This is pseudo code for full flow processing event

Conclusion

Using more retry topics and a dead letter queue topic enabled us to retry the request without blocking the main consumer. Developer will monitor easily events that failed all the times in DLQ and manually remove or reprocess it.

One downside of this approach is that it creates multiple topics if we want to retry many times. So we should limit the number of retries.

Another issue to consider is interrupting retrying flow of an event if needed. We can delete records in Kafka’s retry topic (tombstones).