How to Roll Your Kafka Cluster With Zero Downtime and No Data Loss

Published in

Riskified Tech

4 min readOct 31, 2022

In my previous post, I described how we migrated our Kafka infrastructure from AWS CloudFormation to Terraform to gain control and stability. I can honestly say that the effort paid off, and we were really satisfied with our infrastructure being fully managed by Terraform.

However, it didn’t take long to understand that this alone was not enough. The one thing we didn’t consider was getting the balance right — in terms of Flexibility.

In other words, how fast could we react to compliance/security requirements and apply changes to our infrastructure?
Also, what impact did our changes have on our users/developers?

In this article, I will cover how we approached these challenges, in terms of design and architecture, to support a rolling deployment of our Kafka clusters with zero downtime and no data loss.

The Problem

From the infrastructure perspective, managing a cluster-based platform in Terraform has clear advantages. Issues arise when we want to apply breaking changes (like changing the instance type, AMI, or user data) on our instances. In this case, Terraform will terminate and replace all instances simultaneously, causing downtime and data loss, and directly impacting the business.

We wanted to achieve the ability to detach infrastructure changes from being deployed on all instances at once, so we could have gradual control over the process.

Kafka Cluster Architecture

The real magic behind our deployment is the ability to apply changes to our infrastructure, without actually changing it. What makes it tick is simply wrapping the instances with an Auto Scaling Group (ASG), with a max capacity of 1.

So, when we change something, it won’t be applied directly to the instance, but to the ASG launch template, which is obviously versioned. This allows us to control the cluster’s gradual replacement by applying the new versions one by one.

The following is a diagram illustrating our cluster architecture using AWS ASG, EC2 and AZ distribution. Some additional components include Cruise Control for cluster supervision, Schema Registry for managing Avro/Proto schemas, AWS CloudMap for dynamic DNS records and HashiCorp Vault for storing secrets/users.

Deployment Pipeline

We use GitHub Actions Workflows as our CI/CD for deploying changes to our Kafka clusters.
Each environment/state has a dedicated workflow that manages its deployed clusters.

In terms of security, the GitHub Runner runs in a dedicated management account with the ability to assume the role of each account (production, staging, sandbox, etc.) for isolation purposes.

This pipeline allows us to:

Review the configuration changes like AMI, tags, security, etc. (terraform plan)
Apply changes to a new ASG launch template (terraform apply)
Deploy the new version to the desired Kafka cluster (by approval), using a rolling deployment mechanism that I will cover next

This 2-step pipeline allows us to first apply the terraform changes and create a new launch template, and only then decide who and when to deploy.

Rolling Deployment

Kafka is a stateful service — designed for resiliency and high availability. It needs to be deployed one by one, while waiting for rebalancing to avoid potential data loss. Otherwise, we risk multiple brokers being down, resulting in offline partitions.

So, now that we can push changes and deploy a new version to the Kafka cluster gradually, we simply need to roll each node gracefully while taking into account the demotion of its leader partitions.

The main players in the deployment are:

The Cloud Provider for executing scale operations
Cruise Control for demoting and rebalancing our brokers, avoiding any chance of offline partitions and the resulting potential downtime and data loss

This decouples our deployment from the user level, allowing us to make infrastructure changes beneath the surface without impacting SLO, downtime, etc.

Wrapping up

Our work is never done.

This part of the journey exemplifies that we should always question our methods to understand if they are still relevant and serve their purpose. Not everything is worth the effort and time, but the one thing we can and should do is think in terms of scale. Designing and building scalable solutions for today, while thinking of tomorrow.