Kafka in F1 — Replaying messages
For the past few years, I've been leading the development of a large, highly scalable, event-driven Microservices platform, acting as the organs for an exciting application at the heart of Formula One.
Why Kafka? In an environment where every millisecond counts and telemetry is being spewed out of cars that drive 372.5km/h you develop a requirement to process thousands of pieces of data every second. Although a complex beast, Kafka can stand up to the challenge.
Why would you need to replay messages?
Imagine it’s Friday, cars on track, data coming in at a rate of knots. One heroic Microservice scaled across the horizon ingesting raw telemetry via UDP, processing and Producing that data into a topic on the message-bus. Another Service consuming said topic whilst piping it into a Database. A perfect bliss of events streaming through the cluster.
Unbeknownst to the engineers, a colleague has tripped over a wire, disconnecting power to a large array of SSDs (Ok, for the sake of dramatic effect we can pretend all our storage is in one chassis…).
Alerts start to fire… The engineers rush to find the problem. One single dangling power-cable is found at the scene of the crime. Instantly, it’s plugged back in. But it’s too late. The cars have returned to the garages…
By this point millions of messages have failed to reach the final destination in the Database, practice session wasted, the team unable to perform analytics on the session’s data.
The consuming service through all of this, happily reading from the topic, unable to place the data into the target DB due to the disk chassis being offline.
If only there were a way to go back in time…
How does Kafka even work?!
I’m not going to go into great depths about the inner-workings of Kafka, otherwise this article will have a read time of roughly 35280 minutes…
Although in order to Replay our Messages we need a bit of understanding of how things work. Well, at least Topics, Partitions and Offsets…
Kafka gives us the means to categorise, group, order and store messages of certain types. This is by using Topics. A topic is just an ordered log of events, any time a new event is received it is appended to the end of the topic.
Partitions are Kafkas way of enabling concurrency. A topic will have n partitions, each in charge of their own event logs.
Every time a Consumer reads a message from the event-log (topic), the consumer commits the offset of the message onto the particular partition it read from.
Back to the story… Our Consumer is now at the latest offset on each partition, hundreds of thousands of offsets away from the start of the session. Luckily, all these events are retained inside our topic, could we use this to go back in time?
Going back to the future…
Now we understand topics, partitions and offsets, we can use this knowledge to populate our now healthy Database!
If we could just set the offset on our consumer to a place before the practice session begun, our consuming service would be able to read and process the events like it were happening live.
Thankfully, there’s lots of tools available that allow us to do just that. Using the kafka-consumer-groups
cli we can reset our consumers offsets on each partition back to a given time. Hoorah!
kafka-consumer-groups --bootstrap-server <kafkahost:port> --group <group_id> --topic <topic_name> --reset-offsets --to-datetime 2021-08-04T12:30:00.000 --execute
We can even reset the consumers offset back by a given duration (i.e. move it back a few hours from now), by using --by-duration PT2H0M0S
.
Note: All Consumers in the target consumer group must be stopped before resetting a topic’s offsets.
Once we’ve wound back time to before the disaster, we can bring our Service back up. The consumer will start to re-consume all of our events into our now healthy database. Starting from the exact time we specified, before the disaster.
Pitfalls
In the solution above we were required to bring down our consuming service in order to replay our old events, thus completely halting any live data we may be receiving.
One way around this is by having an identical service with a different consumer group ID that acts as a dormant replay-service until it is required. This replay-service would then be the consumer which gets its offset reset, leaving the live service to function as normal.
However, we have to be careful. More complications could arise from replaying past messages, such as duplication of data. Typically we can solve this by ensuring our consumers handle messages in an Idempotent manner.
Conclusions
Kafka allows us to completely recover from disasters when all hope is lost, whilst satisfying the need for low-latency, intense throughput of messages.
There’s lots we can do to ensure we don’t lose our data. Topic replication, acknowledgements and much more, but that’s for another article…
Jordan is a Full-stack Software Engineer who specialises in leading and designing both highly scalable Microservices and Progressive web applications!