Achieving high availability in Apache Kafka
Using a single Apache Kafka cluster is a risky proposition, leaving your organization with a single point of failure that might affect all services in your architecture. In our case, we found that two are vastly better than one.
Apache Kafka is an open-source distributed event streaming platform. It’s used by thousands of companies looking for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Like many other companies today, we also implement a microservice-based architecture with multi-region active-active configuration, and with Kafka as an event streaming platform for asynchronous communication. However, we didn’t stop there. We made sure to go the extra mile and go for a high availability setup.
Why high availability in Kafka? When using a single Kafka cluster, we expose our system to a singular point of failure that influences almost all the services in the architecture.
So, how do we achieve this high-availability thing? We use two Kafka clusters instead of one: read from both and write to one. An active-passive configuration. Sounds simple, but like most things that come in clusters, it’s easier said than done. Read on.
I can hear you asking: “Why not just use a multiple-region Kafka cluster?”
Yes, we can achieve this resiliency with a multi-region Kafka cluster. This solution creates a cluster that spans two different regions, and if the main region is unavailable for some reason, the service automatically changes to work on the second region.
However, downtime, which we can’t afford, is still a risk with the multi-region setup (if Kafka is down it means downtime to our system). The multi-region cluster has shared configurations (like topics, consumer groups, etc.) in the cluster, and if something goes wrong, it could affect both clusters.
For example, in the case of the Kafka brokers version update, which is a very common occurrence, we’d want to keep our clusters updated with the latest features and most secured updates. An update means there’s a chance for something to go wrong and result in downtime or latency, which we won’t be able to avoid, since the multi-region cluster will be updated in both regions simultaneously. So that’s bad. There’s also the chance that the update goes smoothly, but the new version ends up breaking and affecting our entire system.
When you have two completely different Kafka clusters (each one of them can be a multi-region cluster) this is what an upgrade looks like:
- We start by upgrading the secondary cluster. We can update it worry-free while it’s not receiving any messages.
- After the upgrade is completed successfully (if it fails, it won’t affect our system. We’ll be able to investigate the cause without risk to our system). We can run tests on the cluster and make sure everything is in order before moving forward.
- Once we’re done with the upgrade and the testing on the primary cluster, we can failover to the secondary cluster. This step will move all new messages to the secondary cluster, where the users can read them. You can now upgrade the primary cluster risk-free.
- We can run the same tests again on the primary cluster.
- Once testing is completed successfully, we can move the writing back to the primary cluster again.
These steps minimize the risk during a Kafka upgrade, allowing us to fit any configuration changes freely.
How high availability with two clusters works
Today, we have two clusters in two different regions, with the same configuration and the same hardware. Our consumer services consume from both clusters at the same time, so regardless of which cluster we’ll produce to, they’ll consume it.
On the producers’ side, we want to produce every message only once, to avoid duplicates. Before producing a message, we need to know which cluster we write the message onto, and we need a source that will tell us which cluster is now “active” for producing messages. Next, we’ll need a place to store the clusters’ output.
To that end, we chose Redis, because it is a key-value in-memory database, with very low latency and a minimal effect on performance. Before we produce, we get the cluster from Redis and produce the message to that cluster.
If we change the value in Redis to the secondary cluster, the producers will get that cluster from Redis and will produce to it. The producers won’t affect anything except where they produce the messages to. On the consumers` side, we consume from both clusters. Once we change the value in Redis and the producers start producing to the secondary cluster, the consumers will stop getting messages from the primary. Now, changes to the primary cluster won’t affect anything in our flow.
In addition, since we have a service that controls the cluster we want to produce, we can avoid changing it directly on Redis.
But wait! There are downsides
Two completely different Kafka clusters give us an extra layer of protection from incidents and enable us to upgrade and make changes in our cluster with minimal risk. This sounds perfect, but like all other solutions, it has its faults. This setup has four downsides:
- Complexity
We have two clusters, which means we need a mechanism that enables us to move between them with minimal impact on the production workflows. This adds some complexity to the system, especially when new developers join the team and need to understand how the whole thing works.
2. Cost
Of course, maintaining two clusters will cost more than one. A multi-region cluster will also be cheaper than two clusters. We also need a place to store the state primary or secondary cluster, so that means a database or storage, which also means additional cost.
3. Synchronicity
We need to keep both clusters completely synchronous. When we add or change a topic, we have to do it in both clusters.
4. Ordering issue
When failover transfers to the secondary cluster, we will have a small window where consumers are consuming data from both clusters (leftover records from the primary and new records in the secondary). Now, the records won’t be processed in order, meaning, in this small window we will have out-of-order delivery. Most times, this is a tiny price to pay in exchange for not having downtime.
It takes two to tango
Having two completely different Kafka clusters with a failover mechanism gives us an extra layer of protection, helps us avoid incidents, and allows for safe changes and updates to our Kafka clusters. Any disadvantages of this setup seem like an afterthought compared to the crisis a service-wide downtime might cause.
Double down on your Kafka clusters and never look back.