How to Migrate Kafka Cluster with Zero Downtime

Dheeraj Kulakarni
MiQ Tech and Analytics
5 min readJun 22, 2021

In MiQ, we strive to make sure all our applications/microservices are Highly Available (HA) and also have a Disaster Recovery (DR) mechanism in place. We have a proprietary messaging broker built on top of Apache Kafka. We had been using a single node managed Kafka cluster which needed to be migrated to a HA cluster for the service to be highly available. This being a critical production application we wanted to avoid downtime during the upgrade.

To do an upgrade without downtime we had to make sure no new messages were lost and consumers picked the messages from where they left off. To achieve this we wanted to enable sync between the old and new cluster and once the sync is achieved we could just toggle the datastore.

Mirror Maker 2

Mirror Maker 2 is a utility given by Apache Kafka itself to replicate the data. It copies both consumer data and topic data to the new Kafka cluster. One of the main advantages of MM2 compared to some other tools in the market is it has support for replicating consumers data very well and also we don’t need to stop the consumers to replicate consumers data which helps to reduce downtime of consumers in the migration process. Configuration is very simple, you just need to provide some trivial info. Below is the sample configuration of mirror maker 2.

If you observe we have given names to our Kafka clusters A, B but you can name them anything and enable replication from A to B using property A->B.enabled=true.You can create any no of clusters and enable any combination you want. If you want to make a setup in which you want to run two clusters and maintain the same copy of data in these two of the clusters you can do that by enabling B->A and A->B both, this is one of the main purposes of Mirror maker 2. For more information on Mirror maker 2 refer here.

When MM2 replicates topic data it adds the data in a topic adding cluster name as a prefix, e.g. to copy the data of topic X in cluster A to cluster B, MM2 will replicate this data to A.X in cluster B.

In our case, we needed the same topic name as it shouldn’t break consumers when they are pointed to a new cluster. Kafka Mirror Maker 2 luckily provides a way to override this. This can be done by extending the DefaultReplicationPolicy class and overriding methods. This project https://github.com/strimzi/mirror-maker-2-extensions has already done this job if you just want to copy to the topic with the same name, you just need to add this jar to the libs folder of Kafka and add the below property

replication.policy.class=io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy

Okay, so everything is good with mirror maker 2 if we don’t use Kafka transactions!

We use Kafka transactions in our service which is important for our use case. Mirror maker 2 doesn’t work well when using transactions. Each transaction uses an extra offset in the partition while publishing data to a topic, so when the mirror maker tries to consume data and publish a new topic in the new cluster it doesn’t end with exact offsets, which means the new topic end offset will be less than the original topic end offset but when consumer current offset is copied it copies as it is which leads to a possibility of negative lag.

Note: This is based on present mirror maker 2 behavior, we hope this would be resolved.

Let's understand this by example,

  1. Consider consumer A listening to topic X with topic end offset 10 and consumer current offset 10 which means the consumer has consumed all the data
  2. Run mirror maker 2
  3. When the data is copied to the new cluster consumer's current offset will be 10 but the topic actual offset will be less than 10 in the new Kafka as some of the offsets were that of transactions, it may be 6 or 7.
  4. Now when the consumer connects to this topic in the new cluster consumer doesn’t understand from which offset it needs to poll as normally it should poll from offset 11 but it doesn’t exist at all.
  5. In this case the consumer checks auto.resets.offset property. If auto.reset.offset is set to
  • latest: you will receive only new events after connecting to it.
  • earliest: you will receive all the events from the earliest offset available in the topic. Choosing value to latest also means every new consumer also receives only events which are published after consumers connected to the topic.

Unfortunately, we could not proceed with both the available options. We can’t dare to lose unconsumed events or send duplicate events to our consumers. After exploring many approaches and doing some POCs we came up with two approaches to go ahead and finalized one. Let us show you a quick overview of these two and then tell you which one we went ahead with.

Approach 1:

Advantages:

  1. No loss of unconsumed events.
  2. Availability of old events, which means if a new consumer comes he can receive old data as well whatever is there in that topic.

Disadvantages:

  1. Downtime of Messaging in publish calls in the transition state where we wait for all the consumer’s lag to become 0.

Approach 2:

This approach is based on the assumption that consumers don’t want the old events, which means going with this approach we agree once we move to new Kafka there won’t be older data available, which implies if a new consumer comes at that point in time he will receive only events published in the new Kafka cluster.

Advantages:

  1. No loss of unconsumed events
  2. No downtime, only deployments downtime

Disadvantages:

  1. Delay of events during that transition part where we wait till all the consumers lag is 0.
  2. no old events will be there in the new Kafka

Conclusion:

When dealing with such migrations it is important to make sure the number of moving parts is less and the approach is as simple as possible, this ensures that the risk of failure in production is minimal. For our use case we

  1. decided to not keep the already consumed data
  2. delay in consumption by a couple of minutes was okay

which made the migration process simple enough to avoid any downtime and failures which is what approach 2 offered.

For your migrations, we suggest understanding what kind of failures or losses are acceptable while migration, keeping only the most important ones will reduce effort and risk of migration significantly, below is a small checklist that we followed

  1. Time taken for the migration should be minimal
  2. Minimal manual intervention
  3. Runbook for the changes
  4. Good to have a rollback strategy

--

--