Apache Kafka — Resiliency, Fault Tolerance, & High Availability

Understanding Resiliency, Fault Tolerance, & High Availability with Apache Kafka

Shreyas Chaudhari
Aug 20, 2019 · 2 min read

Apache Kafka is a distributed system, and distributed systems are subject to multiple types of faults. Some of the classic cases include:

  1. Broker stops working, becomes unresponsive, and cannot be accessed
  2. Data is stored on disks and a scenario causes the disks to fail and then the data cannot be accessed
  3. Suppose that there are multiple brokers in a cluster. Each broker is a leader of more than one partition. If one of those brokers fails or is inaccessible, then it will result in loss of data.

These are the scenarios where ZooKeeper comes to the rescue. The moment ZooKeeper realizes that one of the brokers is down, it does the following:

  1. It will find another broker to take the place of the failed broker
  2. It will update the metadata used for work distribution for producers and consumers to make sure that the system continues to function

Once ZooKeeper has performed the two above steps, the publishing and consumption of the messages will continue as normal. The challenge here is that the failed broker contains some data. Unless a provision is made to replicate the data somewhere else as well, there is a case of lost data.

Kafka provides a configuration property in order to handle this scenario — the replication factor. This property makes sure that all data is stored at more than one broker. Even in the case of the faults listed above, the replication factor will make sure that there is no risk of loss of data.

Another important thing to note is that the number for the replication factor must be determined. E.g. — If the replication factor is set to five, then it means that the data is replicated on five brokers. So even in the case where four out of five brokers go down, there will be no data loss.

Another term to be understood here is In Sync Replicas or ISRs. When the replica set is fully synchronized, i.e. ISR is equal to the Replication factor, it means that each topic and every partition within the topic is in a healthy state.

Let us see how Apache Kafka exhibits these features in the video below -

Steps of the video tutorial above are listed below

References -


Better Programming

Shreyas Chaudhari

Written by

Lead Consultant @ ThoughtWorks | Twitter : shreyasc_tweets | Instagram : shreyasc_clicks

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade