In the Land of Streams — Kafka Part 4: My Cluster is Lost!! — Embracing Failure

A Kafka Streaming Ledger

Giannis Polyzos
5 min readDec 13, 2022
https://www.vecteezy.com/free-vector/squirrel-cartoon

The Blob post series consists of the following parts:
- Part 1: A Producer’s Message
- Part 2: The Rise of the Consumers
- Part 3: Offsets and how to handle them
- Part 4: My Cluster is Lost!! — Embracing Failure (this blog post)

In the previous parts of the series, we have looked at how things work mainly from the application’s perspective.

In this final part, we will focus a little bit on the infrastructure part.

Typically enterprises use Kafka as the backbone of their whole data platform. This means it accommodates a wide range of business-critical workloads and sooner or later no matter how well-prepared you are, things are doomed to fail. This is why it is important to embrace failure when designing (yes fallacies are real) your overall architecture and think in terms of backups and disaster recovery.

In the previous parts, I mentioned (and used) Aiven for Kafka. Aiven is a complete future-proof data platform and what I really like is how easy it is to set up disaster recovery solutions whether it’s on one or multiple cloud providers.

Let’s assume you want to implement a Multi-Cloud Disaster Recovery Solution with Aiven for Kafka. There are two things to highlight here:

1. As Aiven offers no cloud vendor lock-in it allows to easily deploy clusters across different clouds.

2. For data replication to design Disaster Recovery solutions it uses MirrorMaker2.

The two main disaster recovery patterns are:

  1. Active / Passive Pattern:
    uses a secondary cluster that acts as a backup
  2. Active / Active Pattern:
    uses two clusters replicating data between them

Some data syncing/replication patterns can also account for disaster (since data is replicated as well)

  1. Fan-out Pattern:
    data is replicated from one cluster to multiple clusters.
    Use Case: replicate data to different clouds and/or regions
  2. Aggregation Pattern:
    many edge clusters send data to a centralized cluster that acts as the aggregator
    Use Case: aggregate data to a centralized location; for example for data warehouse or operational
  3. Full-Mesh Pattern:
    many clusters send data to each other
    Use Case: different clusters operate in different regions/countries and data needs to become available as a whole in every region/country for operational requirements for example.

The most common pattern (at least based on my own experience) is the Active/Passive — i.e having one cluster active that syncs its data to a secondary one that will become active in case the currently active one becomes unavailable (failover).

Data Replication between multi cloud clusters

An important thing to note for the Aiven services is that the overall communication between the different clouds takes place using an IPSec Tunnel to provide secure networking.

Data syncing between the active and the failover clusters.

MirrorMaker 2 leverages the Connect framework to replicate data between Kafka clusters. You can deploy MirrorMaker2 either on AWS or GCP.

MM2 includes several new features, like:

  • topic data and consumer group replication
  • topic configuration and ACLs replication
  • cross-cluster offsets synchronization
  • partitioning is preserved

In the happy path, you have the producing and consuming applications operating on the Active cluster (on AWS in this case), and then boom the unhappy path reveals itself.

AWS cluster becomes unavailable

The whole cluster goes down and the applications become unavailable.

When working with distributed systems and planning for disaster, one crucial question you and your team need to answer is — what disaster means for us?

For example, your cluster might temporarily become unavailable due to some temporary networking issue, but after a few minutes time, everything is operational again. Is this kind of scenario acceptable for your business? Can you tolerate this or does your application needs to switch as soon as the cluster seems lost? Do you need to switch all your applications immediately?

Answering these kinds of questions will put you in a better position when designing your solution.

The last missing piece is how failover will actually look in practice.
One approach to that problem would be to wrap your application logic with some retry logic.
There are many good libraries out there that provide fault tolerance and resiliency like Resilience4j for Java, Arrow Fx for Kotlin, and ZIO for Scala (notice how all of them are functional libraries).

As depicted in the snippet above, the goal here is — once the applications lose connectivity to the cluster, catch that, have the retry logic kick in, and eventually if needed fallback to the healthy cluster.

Note: You might already be running hundreds of apps in production or have many different teams each creating their own apps. Adding resiliency to your app logic may be hard between implementation and coordination. Alternatives can be adding a LoadBalancer to handle the traffic or using Service Mesh technologies.

Wrapping Up

Failure is something that will happen sooner or later and you need to account for it. When designing disaster recovery solutions you might need to ask:

  1. What should be the definition of — my cluster is lost — within our business context?
  2. How fast do my apps need to switch to a healthy cluster and be operational?
  3. What is the best approach to implementing an application failover?

This is the end of In the land of Streams with Kafka series.

Stay around for more StreamingLedger stories 👋😁

--

--

Giannis Polyzos

Staff Streaming Product Architect @ Ververica ~ Stateful Stream Processing and Streaming Lakehouse https://www.linkedin.com/in/polyzos/