Kafka — Core Concepts & Terminologies
Fundamentals of Kafka
Kafka has revolutionised the processing of events stream in distributed computing. Before the advent of Kafka, the industry employed message queues and publisher-subscriber systems. Both these systems had their limitations in scalability and performance. Kafka took the best of both the world from the two systems. Thus, delivering a scalable, highly-available, distributed commit log.
There is a common misconception that Kafka acts as a message queue. However, the semantics of message queue is different than that of Kafka. In essence, Kafka is a distributed commit log. It’s a highly available, scalable & a fault-tolerant system. It can act both as a message broker and a publisher-subscriber system.
In this article, we will take a walkthrough of the fundamentals of Kafka. We will see all the building blocks which power it’s working. Additionally, we will also take a look at the basics of replication and leader election in Kafka.
An event is anything that takes place in the world. Software applications generate events every time. A user browsing a social media website sends all its activity information to the backend system. Systems can gather statistics from the events data & extract patterns. For eg:- Netflix can recommend new movies to the users based on previously watched ones.
A topic is a logical aggregation of events. A group of events having some common characteristic becomes a topic. A News website may categorise all the events happening into topics like Politics, Sports, Entertainment, Technology, etc
The broker is an application which persists the events or messages. Publishers can use APIs to publish messages to a broker. The broker stores messages in sequential order. It’s similar to a write-ahead log (WAL). A log file is stored on the disk. As soon as a new message is published, the broker appends a new entry to the end of the file.
The messages are persisted as per the retention policy. They can get deleted by default after a certain duration of time. Kafka can be configured to persist the messages indefinitely. Every message written to a topic is given an offset index. This index is incremented every time a new message is written.
A Publisher is a system that communicates events to a broker. Once the message is published, Publisher doesn’t care about what happens to it next. In this manner, Publisher and the consumer are decoupled from each other.
A publisher can be an upstream server application that is producing logs or a client application sending user website activity.
A single broker system is not scalable as it’s a single point of failure. Further, high throughput load can’t be served by a single broker. For scaling the messaging system, Kafka uses the concept of partition.
The partitioning concept is similar to database partitioning. The core idea is to distribute the data on multiple brokers instead of using a single broker. A topic is divided into partitions. Each partition is managed by a different broker. In case of a single broker failure, messages from one partition will become unavailable. However, messages from other partitions can still be read.
Each event can be thought of as a Key-Value pair. Producers can use multiple strategies for distributing the Keys among partitions. A common and efficient approach is Consistent Hashing. Although, others such as Hashing and range-based partitioning can still be used.
As seen from the above diagram, the News topic is partitioned based on type. Hence, related news is now clubbed together in a single partition. For instance, sports news will be added to the partition 1, & politics news to partition 2.
Kafka doesn’t guarantee message ordering across the partitions. However, all the messages written to a single partition are ordered as per their timestamp. To illustrate, in the above figure, it’s guaranteed that the event “Politics — UK Election” has come after “Politics — US Election”. However, we can’t say that “Technology — Self-driving cars” has come before “Politics — US Election”. Since the two events belong to different partitions.
In the last section, we saw how partitioning overcomes a single point of failure. But what if a broker managing a single partition stops responding or goes down? How would you read the messages written to that partition?
Replication or data redundancy comes to rescue here. Every partition has a replica stored on a different broker. Messages are synced or copied to the replica. If the primary partition fails, then the messages can be read from the replica. Broker managing the primary partition is known as Leader for that partition. Other brokers which handle the replica for the same partition are followers.
A broker can be a leader for a given partition and it can function as a follower for other partitions. As shown in the diagram, Broker 1 is a leader for partition 0 whereas it’s a follower for partition 1.
Producers always write the data to the Leader. Once the Leader commits the data, followers poll the leader to bring their replicas in syn with the Leader. The above diagram illustrates the process of data replication in Kafka. Thus, Kafka achieves fault-tolerance through the mechanism of replication.
A consumer group can subscribe to one or more topics. Each consumer group has at least one consumer. Every consumer reads data from a topic partition using an offset. The messages in partition are not deleted as soon as the consumer reads the data. Hence, those are still available for other consumer groups to consume. Every time a client reads data at a given offset, it’s offset is incremented. Further, the consumer will read data at the next offset.
Let’s assume that we have a single consumer in the group and it’s reading from a topic with two partitions. In this case, the consumer will have to read from both the partitions. To increase the throughput, we can add another consumer in the group. Kafka will distribute the load equally among the two consumers in the group. Adding a new consumer will result in 1–1 mapping between partitions and the consumers.
If more consumers are added than the number of partitions, those consumers will stay idle. So, the number of consumers in any group must be less than or equal to the count of partitions.
One of the Broker in the cluster is selected as a Controller. When a new broker process is started, it registers in Zookeeper. The first broker to register in Zookeeper becomes the Controller.
The primary function of a Controller is to check the status of other brokers. Also, when a new leader for a partition is elected, it communicates the same to other brokers.
The Controller periodically checks the health of other brokers in the system. In case, it doesn’t receive a response from a particular broker, it performs a failover to another broker. It also communicates the result of leader election to other brokers in the system.
Currently, the Kafka cluster doesn’t function without a Zookeeper instance. Running a Kafka cluster requires management of topics, partitions, controller and brokers. Zookeeper manages the configuration of the Kafka cluster.
We saw in the previous section how one of the brokers is selected as a controller by Zookeeper. Zookeeper elects a new Controller if the existing Controller stops working.
Zookeeper maintains a list of all the brokers running in the cluster. This list is updated whenever a new broker joins or leaves the system. It tracks the topics, number of partitions assigned to those topics, & replica location. It also manages the access control list to different topics in the cluster.
How Kafka functions as a message queue and publisher-subscriber?
A Message Queue works like a queue data structure. Producers add messages to the rear of the queue. Consumers read the messages from the front of the queue.
As soon as a Consumer reads a message, it is removed from the queue. This is one of the drawbacks of Message Queue. Multiple Consumers can’t read the same message from the queue.
In a Pub-Sub architecture, topics are defined. Publishers publish messages on different topics. Subscribers subscribe to topics and read events published on those topics.
Subscribers can subscribe to multiple topics. Subscribers have to subscribe to multiple partitions of a given topic.
If the message throughput on a topic increases, it will result in scalability issues. Since there are limitations on improving the performance of the Subscriber. Moreover, we can’t leverage parallelism here to improve the overall throughput.
Kafka harnesses the capabilities of the message queue and the publisher-subscriber pattern. The concept of Consumer groups allows Kafka to get best of both the worlds.
Similar to Publisher-Subscriber, Kafka Consumer groups can subscribe to multiple topics. Within a Consumer group, Kafka distributes the partition among different consumers. It rebalances the load distribution with the addition or removal of a consumer.
When the number of partitions equals the count of consumers, Kafka works as a message queue. Each consumer reads the message only once from a topic partition. Additionally, it also provides message persistence within a topic partition.
We have seen in this article the basic building blocks of Kafka. Following is a summary:-
- Events are immutable & are published by the Producers or Publishers
- Kafka works as a distributed commit log. Message brokers are responsible for processing & persisting the messages
- A Kafka topic is a logical aggregation of events. The topic is further divided into partitions
- Every partition is replicated for fault tolerance and redundancy
- Zookeeper functions as a centralized configuration management service
- Kafka can function both as a message queue and a publisher-subscriber system