In this article, we’ll look into Kafka fundamentals which helps Kafka beginners to get insights of Kafka components.
3. Consumer Group
8. Replication Factor
9. Leader and Follower
10. ISR-In Sync Replica
11. Acknowledgement (Acks: 0, 1, all)
13. Broker / Node / Bootstrap Server
Apache Kafka is a scalable, fault tolerant and distributed streaming platform with key capabilities as below.
- Publishes and subscribes to streams of records/events.
- Stores streams of records/events in a fault-tolerant durable way.
- Processes streams of records/events as they occur.
Kafka is best suited for real-time streaming requirements
- Building real time streaming data pipelines that get the data between apps
- Building real time streaming apps that react to the data streams (Event driven systems)
Producer creates new messages and publishes data to the topic. Producer is responsible for choosing which record to assign to which partition within the topic.
By default, the producer doesn’t care about the topic-partition on which the messages are written to and will balance the messages fairly over all the partitions of a topic. Producer directs the message/record to the partition based on the message key and the partitioner.
Consumer and Consumer Group
The consumers can subscribe to one or more topics and read messages in the same order as they are produced. Consumer also keeps track of the messages which are already read with the help of message offsets.
Consumer Group refers to a group of one or more consumers which share the workload. Consumers are mapped to the specific consumer group with group-id/name.
The consumer group assures each record published to a topic can be read by one of the consumer instance from the same consumer group. The records will be load balanced across the consumers from the same consumer group.
A Record can be read by multiple consumers from different consumer groups.
Topic and Partition :
Topic can be a considered as table in database or a folder in a filesystem. Kafka publishes the records into Topics and can have zero or one or more consumers to read the data written to topics. Topics are additionally broken into partitions. Kafka cluster maintains a a commit log for every topic.
Below image shows a topic with 3 partitions
Partition can be described as a immutable collection of messages where the messages can be written in append-only fashion. Record ordering in Kafka can guaranteed per partition.
Each topic can have 1 or more partitions. Number of copies of a partition for a topic depends on replication factor(RF) configured during topic creation.
- If RF=1, only leader has the partition and there are no followers for the partition.
- If RF=2, means there are 2 copies of the partition. One with leader and the other with follower.
- If RF=3, means there are 3 copies of the partition. One with leader and the 2 with followers.
Offset is a unique number assigned to a record on the partition. Whenever a new record is published, current offset will be incremented by one and add the record on the partition. Offsets plays a vital role while reading the data from consumers (consumers read the data based on last committed offset).
Each partition is subdivided into segments, which is a collection of records of a partition. Instead of saving all the information of a partition in a single log, Kafka breaks it into smaller chunks (segments).
Kafka always write records to a active segment of a partition. If a segment size is reached, a new segment will be created and acts as an active segment.
It signifies the number of copies exists for a partition. Replication factor configuration plays important role when comes to availability. A replication factor of N allows you to lose N-1 brokers while still be able to read/write records to the topic reliably.
There are two types of replicas
A partition is owned by a single broker in the cluster and that broker is referred as the leader of the partition. As we know producers and consumers performs writes/reads data on the leader partition and the same will be replicated on the follower partitions. Once the partition leader is elected with leader election, information about the new leader will be notified to all the brokers
A follower partition is responsible to monitor the partition leader to replicate the reads/writes same as in leader and stays up-to-date with leader. Followers which are up to date or in-sync with current leader are part of ISR’s list and are eligible for leader election when the current leader replica dies.
How a record propagate from Producer to Topic partitions?
Producer gets the partition leader information from Zookeeper and it always sends a record to the leader of the partition and the follower gets the record from the partition leader. When a Producer sends a record to the Leader of the partition, records will be replicated to the follower (replicas) partitions as below.
- Producer send a record to partition-0 (leader at Broker-0) it will be propagated to follower partitions (Broker-1 and Broker-2 as in diagram).
- Producer send a record to partition-1 (leader at Broker-2) it will be propagated to follower partitions (Broker-0 and Broker-1 as in diagram).
- Producer send a record to partition-2 (leader at Broker-1) it will be propagated to follower partitions (Broker-0 and Broker-2 as in diagram).
ISR — In-Sync Replica
An ISR is a broker which is fully in-sync with the current partition leader data (i.e., partition offset number reflects same for both leader and ISR’s) , when current partition leader fails, controller is responsible to elect one of the broker from the ISR’s list as partition leader through leader election.
Acknowledgement — ‘acks’ indicates the number of brokers to acknowledge the message before considering it as a successful write. Acks will be configured at Producer.
‘acks’ accepts 3 values ‘0’, ‘1’ and ‘All’
- Acks = 0
Here Producer will not wait for the acknowledgement from the brokers before moving to the next request. It will be assumed as record sent successfully once the producer send the record. In this case, producer wouldn’t have any information whether the record published to the broker successfully or failed due to some error. With this setting there is a high chance for message loss.
This setting can be used to achieve very high throughput since the producer can send the messages as fast as the network support.
2. Acks = 1
Broker (with leader partition) sends a success response to the producer once the leader replica receives the message.
Producer may receive error response and can retry sending the message, if the leader broker dies/failed to receive the message.
This setting reduce the percentage of lost messages compared to ‘acks=0’ since producer waits for acknowledgement from leader replica but the throughput may go down (depends on sync/async transmission) since producer wait for leader acknowledgement.
3. Acks = All
Here the producer will consider the write as successful once all the in-sync replicas receives the record.
This setting makes sure the record is safe since multiple brokers receives the message and the message will survive even in the case of crash.
Note: There are cases, where leader is the only in-sync replica (or) topic created with replication-factor as 1, partition as 1, one broker. The setting ‘acks=all’ may give error response based on minimum ISR configuration.
Zookeeper is responsible for
- Cluster membership: Kafka uses Zookeeper to maintain the list of brokers that are currently members of the cluster. Every broker registers itself with Zookeeper with unique broker id.
- Controller election: First broker that starts in the cluster becomes the controller by creating an ephermal node in Zookeeper. Zookeeper maintains the controller information and other brokers information. when the current controller dies Zookeeper elects new controller from the active brokers in the list.
- Topic configuration: Zookeeper maintains topic metadata — like list of all the topics, the number of partitions for each topic, partition leader information and location of all the replicas, customized/overridden configuration for the topics etc..
- ACL (Access Control Lists): Zookeeper maintains ACL information which helps to control who can do what.
Broker (or) Node (or) Bootstrap Server
It’s an instance/Virtual machine/physical machine, where Kafka is running. In Kafka world, server is referred as Broker.
Kafka Broker is responsible for managing the partitions, handle reads/write requests from the clients (producers, consumers), managing partition replications.
Controller Broker (or) Cluster Controller (or) Controller:
Kafka cluster is nothing but a group of one or more brokers, out of which one broker acts as a cluster controller (first broker that starts in the cluster become the controller (or) elected automatically from the live members of the cluster when active controller dies/fails for some reason). At any point in time, cluster can only have one active controller.
Controller is just like a normal broker with additional responsibility. It will do its common broker responsibilities as managing partitions, handle read/write requests, manage replication of partitions.
Controller’s additional responsibility include — assigning partitions to the brokers, keeping track of the nodes and appropriately handle broker that join/fails/leaves the cluster and reassign the partitions accordingly.
That’s all for this article on Kafka fundamentals, hope it helps to have basic overview on Kafka components to start with Kafka learning.
Will come up with new article to understand these components in depth.
Learn Kafka Connect: Kafka Connect: Quick Start