A tale of missing offsets

One of the projects of Sixt Tech is a new mobility platform based on concepts like micro services and event sourcing, with technologies including Kubernetes and Apache Kafka. On this platform, we recently encountered an issue we’d like to share, since it’s a mistake that you’ll make easily without necessarily noticing it and with possibly massive impact.

Most of our services are written in either Java or Go. The service in this story is written in Go, but the only reason we’re mentioning this is, because we’re referring to Sarama, but besides that, this will happen in any other language since it’s a configuration lapse we’ve discovered here.


tl;dr

The default offset retention in Kafka is 24 hours.

When there is no update for 24 hours, the offset is gone.

This won’t impact the service while it’s running, but the offset will vanish when there’s a rebalance or the service is restarted.

sarama-cluster used to have a method to avoid this by frequently refreshing the offset, but this feature seems to have been removed again.

Think sensibly about Consumer.Offsets.Initial or auto.offset.reset in Java; the current default in our libraries is sarama.OffsetOldest. This would mean, that when no offset is found, it will reprocess ALL messages in the topic/partition. However OffsetNewest would start with the next message in the topic/partition and therefore you might miss events.

A Kafka retention period of, for example 7 days is only a guaranteed minimum. Kafka will only guarantee, that messages are at least this long available. There is no upper boundary and depending on the message volume in a partition there might be messages that are months old


Prelude

We’re currently running Kafka 1.0.1 in a multi node cluster with default settings. The defaults are sensible enough and we haven’t had requirements yet to adjust those defaults. Additionally we wanted to avoid premature optimisation that might cause trouble in the long run.


Act I

Once upon a time, there was a service (that we’ll not name). That service was receiving change notifications over Kafka and was happily processing them for years. But on one rainy day in February, it suddenly started to send out weeks old emails which left the developers confused and sad.

What confused them even more, was the fact, that they actually prepared for that case by adding a deduplication mechanism to their tiny little service, that would prevent messages from being processed, that they’ve already seen in the past. Also the initial response by their community included responses like “Have you tried turning it off and on again?” and “You’re holding it wrong”, but this didn’t distract them as they were on a quest to shine light into these sinister events.

When looking at those nasty messages, they noticed, that they are several weeks old and should long be gone. So they went on a journey to find a bad guy, that by accident or on purpose reingested those events into the topic that they were joyfully reading all the time. However no such actor was found. The elders of the Internet presented them with a clue: The retention configured in Kafka is only a lower boundary, that ensures that data is at least available for the configured period (7 days). But there is no upper limit. This struck them, as the deduplication would only cover a period of 14 days which was clearly not sufficient in this case.

But it remained a mystery how these messages reappeared magically in their service.


Act II

After some council meetings, it emerged, that there might be something wrong with the offset handling. So another quest was started.

The first stop was their initialisation code, to verify that a proper consumer group is used. Which was the case.

So they began their way into the depths of Kafka. There they unpacked their tools and invoked kafka-consumer-groups.sh. This allowed them to view the current state of the consumer group and the partition offsets for the topic in question. Everything looked nice and clean, the Lag was 0. But what was THAT? The offset column for one partition was showing only a dash. That didn't seem normal.

So hastily they moved over to our metrics in Grafana to check the consumer group dashboard (like this). When they looked at data for the past week, they noticed a lot of instances where the lag was reported as -1 which seemed very odd to them.

Nervously they convened again for a council meeting to find explanations for this unusual behaviour.

Eventually they had the suspicion, that the offset was being removed due to inactivity, especially since the topic in question did not get much messages. So they went back to their metrics, to look at their handler statistics. There they could see, that the topic had less than 50 messages in the last 7 days. Even without an Excel sheet, they could quickly grasp, that 50 messages divided by 12 partitions would implicate, that each partition in that topic received only 3 messages in 7 days. This fact and their wisdom of knowing, that partition offsets are only retained for 24 hours let them realise, that the missing offset was caused by an expiration of the retention period.

With this enlightenment they sent another request to the elders of the Internet, which responded by presenting them with a Kafka change request with the title Revise Expiration Semantics of Consumer Group Offsets. So they knew they were right, that the behaviour they’ve seen was actually caused by Kafka deleting their precious offset due to inactivity. And when they innocently deployed a new version of their service, their service could not find an offset, so it summoned sarama.OffsetOldest. Which triggered the reappearance of those ghostly messages.


Returning peace

With their newly gained knowledge about the cascading edge cases, that led to their journey in the first place, a request was sent to the SREs, to that that limit increased. This is also a change done in Kafka 2.0.

They also consulted the library if any of the many scrolls they store there contains some prophecies, that this would eventually happen. It left them scared, when not only they found no warnings of this but also some inaccuracies in some places. They would not necessarily have prevented the events, but they were wrong anyway and needed correction.

https://xkcd.com/386/

They also started crafting this report as a means to warn future adventurers about the dangers that are hidden in Kafka.

To be continued…