Migrating Kafka with Mirror Maker 2 and Kafka Connect: A Step-by-Step Guide
Problem Statement
We need to migrate data from one Kafka cluster to another without compromising the data integrity. The solution must be able to move data between the two clusters quickly and reliably while preserving message delivery order and guaranteeing data integrity. The solution should also be able to scale to handle large volumes of data.
This brings several challenges in migrating data from one Kafka cluster to another
1. Data Loss: One of the biggest challenges in migrating data from one Kafka cluster to another is the potential for data loss. Since Kafka does not provide a built-in mechanism for replicating data between clusters, data loss can occur if data is not properly replicated between the two clusters.
2. Synchronization: Another challenge is ensuring that the data is synchronized between the two clusters. Data must be copied from one cluster to the other in order to maintain consistency.
3. Security: Security is also a challenge when migrating data between clusters. Security credentials must be configured for each cluster in order to ensure that the data is secure and encrypted.
4. Performance: Performance can also be affected when migrating data between clusters. The transfer of data between clusters can take time, which can affect the performance of applications that rely on the data.
5. Data Formatting: Data must also be properly formatted in order to be moved between clusters. Different types of data may require different formatting, which can add complexity to the migration process.
Here are a few possible options to migrate Kafka from one cluster to another or have an Active-active DR strategy:
1. Replicator Tool: Apache Kafka provides a tool called Replicator that can be used to replicate topics from one cluster to another.
2. Mirror Maker 2: Mirror Maker 2 is an open-source tool that replicates topics from one Apache Kafka cluster to another. It uses consumer groups to replicate topics, thus allowing the topics to be replicated with minimal disruption to the source cluster. It also allows for parallel replication of topics, which can be especially useful if you are migrating large amounts of data. In addition, Mirror Maker 2 can be used to replicate data from multiple source clusters to a single target cluster. This is a great option for consolidating data from multiple sources into a single target Kafka cluster.
It is important to note that Mirror Maker 2 should not be used as a production Kafka application as it does not provide consumer offset management. However, it can be used to quickly migrate data between clusters.
3. Kafka Connect: Kafka Connect is a distributed, fault-tolerant, and scalable data integration tool for streaming data between Apache Kafka and other systems. It has several connectors that can be used to move data from source systems to target Kafka clusters. It is a great choice if you need to move data from a source system that cannot be directly connected to the Kafka cluster, such as a database or a flat file system. Kafka Connect also allows you to set up multiple tasks to move data from the source system to the target Kafka cluster. This is especially useful for large data migrations since it allows you to parallelize the data migration process.
4. Custom Scripts/Manually: It is also possible to write custom scripts to replicate topics from one cluster to another. You can possibly use the Kafka command line tool, Kafkacat, or a similar tool.
Mirror Maker 2 can be leveraged either in Standalone mode or using Kafka Connect. In this blog post, we will focus on how Mirror Maker 2 and Kafka Connect can be used to migrate Kafka data.
Kafka Connect + Mirror Maker 2 Migration Steps
A complete solution is available at following GitHub Repo:
Clone this repo and follow the steps.
git clone https://github.com/maxyermayank/kafka-migration-mirror-maker2.git
Step 1 — Setup Kafka Connect and other services as per need
Deploy Kafka Connect using Docker or your preferred approach. In the Repository provided previously execute these commands to run in Docker Swarm mode or you can modify the docker-compose.yml file to run as a regular docker container using compose.
Assuming Docker Swarm is initialized.
Update the following variables in the .env file for your needs:
- BOOTSTRAP_SERVERS — your destination/target Kafka brokers
- SCHEMA_REGISTRY_HOSTNAME — test.company.com
- CONNECT_REST_ADVERTISED_HOST_NAME — test.company.com
- KAFKA_REST_HOST_NAME — test.company.com
Run services in docker swarm mode using the following command:
export $(cat .env) #Source Environment Variables
docker stack deploy -c compose-files/docker-compose.yml -c compose-files/development.yml kafka-services
Check the status of docker swarm services using the following command:
docker service ls
Once all services are up and running you can verify Kafka connect by hitting http://localhost:8083/connectors using browser or curl.
Step 2 — Configure Kafka Connecter for Mirror Maker
There is a total of 3 connectors required:
- Source Replication Connector — to sync topics and Consumer Groups between source and destination Kafka Clusters.
- Checkpoint Connector — Mirror make keeps checkpoints as it replicates data. So if it’s interrupted it can pick up where it left off.
- Heartbeat Connector — as the name suggests keeps the heartbeat between clusters.
You can set up connectors using Postman or Curl or any other way. The repository provided above also has a Postman Collection and Environment variable JSON files.
Once you open Postman Click File → Import. Which will bring Upload prompt. Select the Environment file and Import it.
Modify Environment variables as shown in the screenshot below:
You may or may not need username/password variables depending on the security of your Kafka Clusters. If both of your Kafka clusters have different usernames then make sure to set up SOURCE_USERNAME and TARGET_USERNAME.
Now Import Postman Collection in a similar way. Add comma separate list of topics and consumer groups you do not wish to replicate to the blacklist. Also, update Kafka cluster security variables as per your need. In my case, I had ACL with username/password as well as SSL communication.
As seen in the screenshot side navbar the are 3 PUT endpoint requests which set up Replication, checkpoint, and heartbeat connects. Also, there are Delete endpoints provided in case you are looking to run through the process several times and need to delete connectors for a clean run.
There is also GET which you can hit to check the status.
In the JSON request of MirrorMakerReplication connector, you can also adjust task.max to have more parallel execution.
Make sure to run all 3 PUT requests to set up these connectors.
Step 3 — Validate Connectors’ status
Once the Kafka Connect connector is up and running, you can start the data migration process. The process is simple: Kafka Connect will read data from the source cluster and write it to the destination cluster.
You can check using the Browser http://localhost:8083/connectors
OR GET requests provided in Postman collection.
Step 4 — Monitor the replication process
You can monitor the replication process using several different approaches or all of them combined to get comfortable with this solution.
- Grafana
- Tools like Kafdrop
- Kafka-topics, and kafka-consumer-group binary/CLI tools.
Few useful CLI commands:
List Kafka Topics
kafka-topics \
--command-config <YOUR_PROPERTIES_FILE> \
--bootstrap-server <COMMA_SEPARATED_BROKERS> \
--list
Get Topic Count
kafka-run-class kafka.tools.GetOffsetShell --command-config <YOUR_PROPERTIES_FILE> \
--bootstrap-server <COMMA_SEPARATED_BROKERS> \
--topic '<TOPIC_NAME>' --time -1
List Consumer Groups
kafka-consumer-groups --command-config <YOUR_PROPERTIES_FILE> \
--bootstrap-server <COMMA_SEPARATED_BROKERS> \
--list
Describe Consumer Groups
kafka-consumer-groups --command-config <YOUR_PROPERTIES_FILE> \
--bootstrap-server <COMMA_SEPARATED_BROKERS> \
--group mm2 --describe