Consolidating Kafka, or How To Stay Sane and Save Money

Photo by Lance Grandahl on Unsplash
  • Michael Wagler
  • Jason Moore
  • Ken Ho
  • Nur Ozgun

1. Why Consolidate

2. Analysis and Planning

3. Potential Designs and Processes

  • Ease of implementation
  • How to ensure data integrity during the process; ie no missing data as well as no out of order consumption of data
  • How much lag consumers would experience during the process
  • Could we do any sort of rollback?

Downsides

  • These 2 approaches were highly complicated. They required a large number of pull requests whenever we added or removed a topic from a cluster, and they both included multiple versions of each topic, with different prefixes depending on the cluster, which would be extremely difficult to keep track of.
  • Neither of these approaches allowed for any real kind of rollback, since they each would reach a point of no return. This turned out to be true for our final design as well, but since it was much simpler to execute and required far fewer steps, the effort required to roll-forward made it much more viable.
  • In general, the other consumer-first approaches (using mirrormaker) we encountered resulted in interleaved events. This is because if you mirror events to a new cluster, the moment you move the producers to produce directly to the new cluster instead of the original, there will be interleaving between the mirrored events from the original cluster and the newly produced events to the new cluster.
  • The alternative to interleaving for the consumer-first approach is generally data loss: some of the other approaches we encountered suggested that in order to avoid interleaving when the producers move to the new cluster, you first turn off the producers briefly until the mirroring lag from the old cluster to the new cluster drops to 0. Since we could not tolerate any data loss, this was also not an option for us.

Final Design and Justification: Producer First, New Cluster

  1. Make all of the producers produce to the new cluster.
  2. Wait for all the consumers to finish consuming on all of the old clusters.
  3. Move all the consumers to consume from the new cluster.
  • There would be no risk of duplicate events, since we were not bothering with mirroring.
  • There would not need to be any interleaved events, since events were only ever produced directly by the producers, not also by mirrormaker.
  • There would be no data loss, since the producers would have no downtime
  • Stage 0: Consumers that can handle out-of-order events
  • Stage 1: Producers producing only to consumers that handle out-of-order events
  • Stage 2: Services in self-contained subgraphs
  • Stage 3: Producers and consumers that cannot tolerate out-of-order events or high lag
  • Stage 4: Consumers that can tolerate a high amount of lag

4. Migration Process

  • Setting up the infrastructure for the new Kafka cluster
  • Creating the same topic structure in the new cluster
  • Creating pull requests to change the Kafka configuration for each consumer/producer service and getting approvals for those changes
  • Ensuring all the tools we wanted to use for monitoring were set up correctly and accessible by the team. These included our Prometheus metrics, logs, CMAK (a Kafka monitoring tool) as well as direct access to the Kafka brokers themselves so that we could use the Kafka CLI tools, if needed.
  • Preparing new Grafana dashboards to be used during migration. These leveraged the Kafka server metrics exposed via JMX and focused on consumer group lag, incoming messages per topic and general health indicators like producer request rate and latency, which could indicate misconfiguration.
  • Making necessary library upgrades to the clients
  • Setting the “auto.offset.reset” config value to “earliest”
  • Making sure all the CI pipelines were working properly a day before the migration
  • Stage 0: Consumers that handle out-of-order events
  • Stage 1: Producers producing only to consumers that handle out-of-order events
  • Stage 2: Services in self-contained subgraphs
  • Stage 3: Producers and consumers that cannot tolerate out-of-order events or high lag
  • Stage 4: Consumers that can tolerate a high amount of lag
  • The amount of PRs we had to deal with was too large
  • The manual Kafka configuration change for each service was too error-prone (typos were easy to miss)
  • CI pipelines could be flakey, and they slowed down the process
  • We initially missed some producers which were therefore not updated, which could have caused data loss or an outage if they had been missed in production
  • Monitoring so many moving parts at the same time was challenging
  • The process was smooth enough that we didn’t need to follow our disaster recovery process
  • The action items we came up with during the retrospective meeting
  • Creating new key/value pairs for the Kafka broker configuration in a Kubernetes configmap
  • Creating a new set of PRs for services to read broker strings from the configmap, and deploying these changes ahead of time
  • Creating configuration change PRs for the non-k8s services in stage 3
  • Searching for newly created producers/consumers since the staging migration, and adding them to the correct stage of the production plan
  • Perform a diff of consumer groups on the old clusters and new clusters, ensure they were the same
  • Update our Jenkins job management pipelines with the new cluster and removing the old clusters
  • Update some of our monitoring tools to point to the new cluster and not to the old ones

5. Outcomes, Learnings, and Next Steps

Stats

Learnings

Next steps

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ken Ho

I’m a senior software engineer on the Backend Platform team at hootsuite