Moving a Busy Kafka Cluster (Part 3)

Ralph Brendler
project44 TechBlog
Published in
3 min readDec 6, 2022

In previous installments, we set up XDCR to migrate the data for a busy Kafka cluster onto entirely new infrastructure, and updated the kafka clients to work in a multi-cluster environment.

With these changes in place, we have one final hurdle to overcome before we can actually migrate the services — consumer offsets.

Managing Consumer Offsets

When we deploy one of our services to the NEW cluster for the first time, there will be no saved offsets. How the consumer handles this depends on the auto-offset-reset configuration:

  • auto-offset-reset: latest (the default) means to ignore any messages that have already been received, and start processing new records as they come in
  • auto-offset-reset: earliest means to replay all messages from the beginning of the topic history

If we use earliest, we are guaranteed not to miss any messages, but we may end up replaying millions of messages, which can take time and resources. If the topic is fairly small, this is usually the safest choice.

If we use latest we may miss some messages that come in during the initial cutover to the new cluster. This may or may not be a problem, depending on how the service works. For low-value messages or messages that will be replayed periodically, this approach works well.

Most of our topics fell into one of these two buckets, but in the cases where we didn’t want to replay all of messages but also couldn’t allow any messages to be missed, the deployment became a bit more complicated.

We experimented with many ways to deal with this issue, but ended up with a fairly straightforward multi-step solution that became affectionately known as “The Rollback Trick”.

The Rollback Trick

The secret here was to create the offsets in Strimzi by doing a deploy, then immediately rolling back to the previous version.

The basic procedure goes something like this:

  • Set auto-offset-reset to latest in the service configuration
  • Deploy the new version of the service, pointing at the NEW cluster. This will start reading from the end of the topic, and establish the consumer offsets in the NEW cluster. Note that some messages sent to the old cluster during this migration may have been missed!
  • Once the service comes up, immediately rollback to the previous version (pointing at the OLD cluster), and let it run for a few minutes. This will pick up processing where the OLD service left off, and send any messages that might have been missed during the move to the NEW cluster.
  • Once we are sure all of the missed messages have been replayed, we can deploy the new version again. The service will pick up where it left off, and no messages will have been missed.

This trick also minimizes the number of messages replayed, which is really helpful for messages that are expensive to process.

Decommissioning the Old Cluster

Once all of the services/applications have been migrated to the NEW cluster, we are finally in a state where all consumers are on the NEW cluster, but all producers are still pointing to the OLD cluster.

The final step in the migration is to flip the producers to the NEW cluster (we store this configuration in a shared Kubernetes ConfigMap, so it’s a pretty easy change), and do a rolling restart of all applications.

Once the restart completes, all traffic is being processed through the NEW cluster, but there may still be some “in-flight” traffic still being mirrored. Once we validated that there was no traffic on the mirror topics (monitored over 24 hours, just to be on the safe side), we could fully decommission — MM2 was removed, the OLD.* topics deleted, and the old Kafka brokers were scaled down and removed.

The Result

Using the steps outlined here, we were able to migrate several large and busy Kafka clusters to an entirely new environment with no downtime and minimal impact on the existing applications. Over the course of a few weeks, we migrated dozens of services to new Kafka clusters with minimal lag and zero messages lost, all while maintaining our target uptime.

--

--

Ralph Brendler
project44 TechBlog

Principal software engineer with a long history of startup work. Primary focus is currently on scalability and distributed computing.