Wix Engineering
Published in

Wix Engineering

Migrating to a Multi-Cluster Managed Kafka with 0 Downtime

Photo by Chris Briggs on Unsplash

Split Overloaded Clusters

Kafka usage at Wix before the migration

Why Managed Kafka

  1. Better Cluster performance & flexibility
    Brokers Partition rebalance is done for you, so brokers don’t become performance bottlenecks. You can also easily scale the cluster capacity up or down in order to achieve cost effectiveness.
  2. Transparent version upgrade
    The Kafka code base is being improved all of the time, especially focused on KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum. Which, when completed will dramatically increase the amount of metadata each cluster and broker can hold. By going managed, you get automatic version upgrades which improve performance and fix bugs.
  3. Easy to add a new Cluster
    In case you require a new cluster. Setting it up is dead simple.
  4. Tiered Storage
    Confluent Platform offers Tiered Storage, which allows to dramatically increase the retention period for Kafka records without paying for expensive disk space, by shifting older records to much cheaper S3 storage, without adding latency to latest-offset record consumption.

Switching 2000 Microservice to Multi-Cluster Kafka architecture

How to Split

Migrate Traffic Drained Data Center?

  1. There are services that are only deployed in one of the data centers and it is difficult to move them.
  2. This design means all-or-nothing and is very risky for the moment traffic returns. In case of edge cases that were not switched, they may suffer data loss.
  3. Traffic cannot be completely drained from a data center for a long period of time, because that could greatly increase the risk of downtime to some service.

Migrate With Traffic

Replication

Consumer Migration

Producer Migration

Replication bottleneck

Sharded Replicator Consumers

Beyond Migration — External Consumer Control

  1. Switch cluster — unsubscribe from current cluster and subscribe to another one
  2. Skip records — skip records that can’t be processed
  3. Change processing rate — dynamically increase or decrease amount of parallel processing or add delay for throttling and back-pressure
  4. Redistribute records — In case there is a growing lag on a single partition, be able to redistribute records among all partitions (and skip the old ones).

Best Practices and Tips

  1. Create a script that checks state by itself and stops if expected state is not reached
    Making the migration process as automatic as possible is key, so having the script check by itself if it can move to the next phase greatly expedite the process. Un the other hand auto-rollback and self-healing is very difficult to get right, so best to leave that for manual intervention.
  2. Have a rollback readily available
    No matter how well you tested your migration code, a production environment is the realm of uncertainty. Having a readily available rollback option for each phase is very important. Make sure to prepare it in advance and to test it as much as possible before you start running the migration.
  3. Start with test/relay topics and no impact topics
    Migration can be very dangerous, as records can potentially be lost or recovery can be painful. Make sure to start testing your migration code with test topics. In order to have a realistic examination. Use test topics that actually mimic production topics by replicating real production records to the test topics in a special test app. This way, failure in the consumers migration process won’t have any production affect, but it will provide you with a more realistic production simulation.
  4. Create custom metrics dashboards that show current and evolving state
    Even if you create a fully unattended automatic migration process, you still need to be able to monitor what is going on and have tools to investigate in case there is an issue. Make sure to prepare dedicated monitoring dashboards in advance that clearly show the current and historical state of consumers and producers you are migrating.
    In the graph below we can see how a producer successfully switched from self hosted cluster to managed cluster (throughput goes down as more and more pods are restarted and read the new configuration)
  5. Make sure to be on latest patch version of your self-hosted Kafka broker
    As our self-hosted Kafka brokers were not on latest patch version, we ended up with a production incident when trying to increase the value of message.max.bytes several times (see more in Replication bottleneck section of this post). My recommendation is to first increase your self-hosted cluster Kafka brokers versions. If not to latest, then at least to latest patch version.
custom metrics dashboard for producer migration

Summary

A talk I gave at Devoxx UK 2022 based on this article

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store