Kafka Resilience Strategies: Active-Active vs. Active-Passive
Kafka has become a cornerstone in many business-critical IT architectures due to its robust data replication and resilience against single node failures. However, when it comes to ensuring business continuity and disaster recovery (DR), choosing the right backup strategy is crucial. This post explores the nuances between active-active and active-passive backup solutions in Kafka-based environments.
Active-Active Backup
Active-active backup involves maintaining a primary Kafka cluster that handles all business operations, alongside a secondary cluster that mirrors the primary one in real time. In the event of a primary cluster failure, the secondary cluster can immediately take over, resulting in minimal downtime.
Solutions like Mirrormaker 2, Replicator or Confluent Cluster Linking are prime examples of this.
The advantages of this approach:
- Near-zero downtime during failover.
- Continuous availability for critical applications.
The disadvantages:
- Double provisioning costs.
- Potential for issues getting replicated across both clusters.
- Higher maintenance and operational complexity.
Active-Passive Backup
Active-passive backup involves backing up the primary Kafka cluster’s data to a different storage medium, such as disks or blob storage. This method is particularly useful for long-term data preservation and recovery from severe incidents that affect data integrity.
The advantages of this approach:
- Comprehensive protection against data corruption and loss due to technical or human causes.
- Flexibility: Files are a lot easier to handle than an active Kafka cluster. It also opens the door for implementing air-gapped or isolated storage solutions.
- By far the cheapest option
The disadvantages:
- Recovery times are longer.
- Backing up the data is usually quite easy. Many have setup a Kafka Connect to their blob storage of choice. Restoring it however is a non trivial task to do well without specialized solutions
Choosing the Right Strategy
The saying goes "To choose, is to lose". This also applies for both options here.
Businesses need to assess several factors to determine the most suitable backup approach:
- Business Continuity Requirements: assess acceptable downtime and the criticality of immediate switchover. Keep in mind to also assess the kind of failures you want to protect against.
- Data Value and Integrity: evaluate the importance of data and the potential impact of data loss or corruption.
- Risk Profile and Strategy: understand the overall risk appetite and develop a robust Disaster Recovery strategy accordingly.
Taking the risk might be feasible, weigh this agains the potential "damage" in case of a disaster.
Also keep in mind that a disaster can have many guises. Cyberattack, Hardware or datacenter outages, developer mistakes or plain "oops" moments in the operations… all can classify as disasters to your data integrity.
Incremental Adoption
The chosen strategy and required capabilities can also adapt over time.
One notable example involves starting with a non-critical order processing segment focusing on stock reservations and forecasting. Initially, an active-passive backup approach was sufficient, as minor downtime would not immediately impact operations. As more critical processes were integrated, active-active backup was added to ensure continuous availability, while the active-passive backup remained for comprehensive disaster recovery.
To summarize
TL;DR;
- Understand the bigger picture for your kafka usage and associated risks.
- Single solutions are often only a piece in a larger puzzle. — There are no silver bullets
- Make informed decisions based on what’s possible, practical and affordable.
- Re-evaluate choices and strategy every now and then to align with ever evolving business needs.