From AWS CloudFormation to Terraform: Migrating Apache Kafka

Published in

Riskified Tech

8 min readAug 11, 2021

“With great power comes great responsibility.”

Achieving control of your system is not something that can be accomplished overnight. It requires time, tests, and research. You need to learn it the hard way, handling real-time crises to understand what your system is missing.

In Riskified’s case, facing multiple production issues with Kafka made us realize that what we had was not enough. Our Kafka deployment simply didn’t provide us with the flexibility we needed when deploying changes to our cluster, applying security updates, providing independence to our Kafka users, cluster supervision, and more.

To address this issue and be able to overcome its challenges, Riskified decided on a significant effort with one well-defined goal:
We want to be able to define ownership & provide independence.

Fine-grained ownership and self-service tools would lead to less coupling between the teams and to improved productivity of the developers. This way, they would be able to focus on their business layer rather than on handling the infrastructure.

Sounds difficult? Perhaps. But after some research, we did manage to decide on a solution: Terraform.

I won’t cover our entire process here but would instead like to focus specifically on Kafka.

By the end of this blog post, I hope to convince you that user independence & system control can co-exist.

Apache Kafka: An Unexpected Journey

We started this initiative by migrating our system from a big monolith to Microservices with Apache Kafka as a message bus.

Kafka was still a mystery to us, and like with every new technology you adopt, you don’t know all the factors at the beginning and don’t know what to expect when the scale and load start to increase.

The Initial Kafka Deployment

Our Kafka servers were running in Dockers over dedicated EC2s, with AWS CloudFormation managing the infrastructure. It wasn’t the best solution for our increasing requirements, but as you all know, when you build something new, you tend to rely on something you already know and feel comfortable with.

This setup was good enough for the POC in the early days of Kafka at Riskified, but as more and more products and teams started to use Kafka we realized we needed a more stable and robust infrastructure.

Gaining Control With Terraform

First things first — we needed to map our main pain points from the current Kafka deployment, understand how Terraform could help us address them, and then basically redesign our Kafka blueprints.

If we compare AWS CloudFormation to Terraform, we can find different pros & cons for each. Both are powerful and mature tools that provide a reliable way to automate the deployment of our cloud resources.

But when we thought about what would benefit our system the most and enable us to achieve the goals we’d set for ourselves, we realized that Terraform could fill the gaps better in various aspects:

Independence
Users can use the Terraform infrastructure to provision their resources by themselves, while the configuration is distributed across the code repositories that own them. Resources like S3 buckets, DynamoDB tables, or in our case Kafka topics can be configured using dedicated Terraform providers and, of course, be moderated by us.
Ownership
Since each service owns its own resources, we can map the system’s dependency graph and resource owners, and evaluate cost per service/team/product etc.
In addition, Terraform allows us to switch ownerships by moving specific resources from one state to the other without any change to the actual resource.
Environments
Terraform infrastructure provides us with the ability to easily manage our cloud resources per environment by simply configuring our modules with different values (variables). For example, create a Kafka topic with different partitions in Staging, Production, etc.
Modules
Terraform comes with native support for modules that enables us to build complex infrastructure (like Kafka) with all of the required resources, and with simple configuration and the ability to pass values from one module to another by sharing their outputs.
Visibility
Terraform separates the planning phase from the execution phase by using the concept of an execution plan. This plan includes all actions to be taken: which resources will be created, destroyed, or modified. We use GitOps as our CD to execute Terraform plans, which provides visibility in terms of changes and executors.
Testability
We can manage and compare the current state of the infrastructure by using GitOps. The fact that the infrastructure is now version-controlled makes it possible to test, deploy, rollback, etc. with a complete audit trail.

Templating with Packer

Our first deployment also included Ansible as a tool for automating the setup of each node (broker/zk). However, we quickly understood that this alone would create possible divergence between nodes over time.
Version changes, unavailable packages, deprecations, etc. are something that we would have to deal with every time we deploy new nodes.
So obviously, the solution for that was to combine this mechanism with something fixed. A base image dedicated for Kafka with the initial setup already installed and without any external dependencies.

The tool we use for this purpose is called Packer.

Packer is an open source tool for creating identical images from a single source configuration. We use it to manage the Kafka cluster by creating a pre-configured template to define an AMI with all the required modules and software installed. This makes it possible to preserve our configuration and to quickly create new nodes.

Bonus Points

In our new terraform deployment, we use Auto Scaling Groups instead of EC2 instances to manage the servers. This allows us to do rolling upgrades of the cluster without downtime.

The Migration: Talking From Experience

Now that we had a new Kafka cluster managed by Terraform in each environment, we needed to migrate all active clients.

It should be noted that although it’s possible to combine the two clusters and basically move the load between nodes (e.g., partition reassignment) and decommission the idle ones, we chose to go with a full client migration due to significant changes between the cluster nodes.

This decision is not something that you can take lightly, since it involves potential downtime, data loss, increase of latency, etc., especially if your system is running in production at a high scale, like in our case.

First, we prepared our cluster by plugging in or creating all of our Kafka resources/components. For example, redirecting our Schema Registry to use the new Kafka cluster as its data store, or re-creating all Kafka topics that exist in the old cluster (using a dedicated Terraform provider). We excluded kafka-streams changelog topics and __consumer-offsets since they are auto created.

Kafka Topic Terraform Provider

We use Terraform to manage our Kafka topics. Users can create/delete/alter topics that they own and we can monitor and supervise changes as code owners.

Note that we already had the Terraform infrastructure needed to create all of our Kafka topics. We therefore simply needed to create a new TF state or add the topics to an existing state by configuring the new cluster in all code repositories (services) that own any Kafka topic. So once we applied the changes, all topics were created, and along the way unused topics were dropped — which is also good.

As a side note, I encourage you to pay attention to unused Kafka-streams changelog topics (as a result of changing the application.id). We discovered that although they are no longer active, Kafka still needs to manage them and their cost may be high in terms of CPU.

Below is an example of how we use our TF module to create Kafka topics.
You can use the external provider (Mongey/kafka) directly, or create a wrapper like we do to control which settings the user can provide and configure defaults that will benefit the system the most.

The Migration Plan

To execute this kind of migration, we first needed to plan our steps:

1. Ownership
Map all Kafka topics to their owner, and check their recent throughput.
A migration is your chance to remove unused topics.

2. Flow
Once the topic ownership map is complete, you can continue building the migration flow by starting to place the producer-topic-consumer chains on a dependency graph (example below).
This flow will guide you through the actual migration process, like a checklist for each service that is redirected to the new Kafka cluster.
In terms of impact on the system, we cared more about data loss and data order than about latency. This is why we chose to migrate one producer at a time, wait for the consumer to finish reading all the data, and only then continue.

3. Dependencies
Once we migrate a producer to the new cluster, new data is no longer available in the old one. We need to monitor the consumers which are left behind and to migrate them according to their lag and progress until the dependency chain is migrated in full.We have 3 kinds of dependencies when migrating to a new cluster:

No dependencies — metric collectors, visualization, schema registry, etc.
If possible, I suggest having eyes (metrics) on both clusters during the migration in order to monitor consumer lags on the old cluster and verify incoming data in the new one.
A “classic” dependency (1:1) — a topic with one producer and one consumer. Those we can migrate without any side effects or sync between teams. Once the producer is migrated to the new cluster, we simply monitor the consumer lag and migrate it as well once it gets to 0.
A “complex” dependency (N:N) — a chain of producers-consumers that send/read data to/from 1 or more topics, and consumers that also act as producers. If we have topics with multiple producers, we need to migrate them all together to preserve order and avoid data loss. In parallel, we monitor the lag of these topics consumers and continue to migrate accordingly.
This requires sync between teams (producer/consumer owners) and orchestration, so I suggest migrating one complex dependency at a time.

4. Impact
It goes without saying that this kind of migration can have any kind of impact on the system. We are basically transferring our traffic from one cluster to another, but by doing so we are actually breaking our flow. So we need to consider the following questions:

Data Loss
Do we want/need the data in the old cluster? If so, we will need to send the data to both clusters for a period of time, or use some kind of replication tool like MirrorMaker to sync between the two clusters.
Latency
In the migration process, each consumer who is left behind to finish consuming the data left in the topic is basically lagging behind in the new cluster once the topic producer is redirected. This can cause potential latency in the system and also affect SLA-based procedures.

5. Process
This is a very delicate process, and issues that arise during the migration can be harmful to the system. The migration should be held in a dedicated war room and include a representative from each dev team, a data engineer who can handle any issues that may occur, and an orchestrator to time the services (producers/consumers) that should be moved at each given moment. The orchestrator is needed to monitor the traffic of the two clusters and the lag of the consumers left in the old cluster, to avoid downtime.

Wrapping up

Control and stability are not something you can just buy at the market, it requires effort and motivation. For the motivation part, I hope I managed to give you at least a starting point… the rest is up to you.

If I can leave you with one message, it is this one:
Own your infrastructure, be in control!