Data Engineering concepts: Apache Kafka
Introduction
Apache Kafka is an open-source distributed event streaming platform for building real-time data pipelines and streaming applications.
Originally developed by LinkedIn, it was later open-sourced and became a part of the Apache Software Foundation.
Apache Kafka is a log-based message broker.
Kafka is designed to handle high-throughput, fault-tolerant, and scalable event streaming in real time.
Architecture
Below are the different components of Kafka:
- Producer
- Broker
- Topic
- Consumer
- Zookeeper
Producer:
- The Producer is responsible for publishing records (messages) to Kafka topics.
- It sends records to Kafka brokers, which handle the data distribution.
- Producers can be configured to send records to specific partitions within a topic or let Kafka handle partitioning using a configurable partitioner.
- Producers can be built into various applications and systems, allowing them to push data into Kafka for real-time processing and analysis.
Broker:
- Kafka brokers are individual instances or nodes within a Kafka cluster.
- A broker can run on bare metal hardware, a cloud instance, in a container managed by Kubernetes, in Docker on your laptop, or wherever JVM processes can run.
- Brokers store and manage the data records received from producers.
- They handle data replication, partitioning, and distribution across the cluster.
- Each broker is capable of serving multiple partitions from multiple topics.
- Kafka brokers communicate with each other to ensure data consistency and fault tolerance, as well as to maintain metadata about topics, partitions, and consumer group coordination.
- They are responsible for writing new events to partitions, serving reads on existing partitions, and replicating partitions among themselves. They don’t do any computation over messages or routing of messages between topics.
Topic:
- A topic is an ordered log of events. When an external system writes an event to Kafka, it is appended to the end of a topic.
- By default, messages aren’t deleted from topics until a configurable amount of time has elapsed, even if they’ve been read.
- Topics are proper logs, not queues; they are durable, replicated, fault-tolerant records of the events stored in them. Logs are a very handy data structure that is efficient to store and maintain, but it’s worth noting that reading them is not too exciting. You can really only scan a log, not query it.
- Each topic can have one or more partitions, which allow for parallel processing and scalability.
- Topics can be configured with properties such as retention period, replication factor, and cleanup policies.
- Producers publish records to topics, and consumers subscribe to topics to consume records.
Consumer:
- Consumers read records from Kafka topics.
- They subscribe to one or more topics and process the records published on those topics.
- Consumers can be part of a consumer group, where each consumer in the group is assigned a subset of partitions within the subscribed topics.
- Consumers can be implemented in various applications and systems to ingest and process data from Kafka in real time.
- Kafka supports both consumer groups with single or multiple consumers, providing fault tolerance and scalability for consuming records.
- Kafka is different from legacy message queues in that reading a message does not destroy it; it is still there to be read by any other consumer that might be interested in it. In fact, it’s perfectly normal in Kafka for many consumers to read from one topic.
- Also, consumers need to be able to handle the scenario in which the rate of message consumption from a topic combined with the computational cost of processing a single message is too high for a single instance of the application to keep up. That is, consumers need to scale. In Kafka, scaling consumer groups is more or less automatic.
Partition:
- Kafka gives us the option of breaking topics into partitions. Partitions are a systematic way of breaking the one-topic log file into many logs, each of which can be hosted on a separate server.
- Each partition is an ordered and immutable sequence of records.
- Kafka ensures that records with the same key are always written to the same partition, ensuring message ordering for records with the same key.
- Partitions can be spread across multiple brokers in a Kafka cluster for fault tolerance and load balancing.
Replication:
- Replication is the process of duplicating data across multiple brokers for fault tolerance and high availability.
- Each topic has a configurable replication factor that determines how many of these copies will exist in the cluster in total.
- Each partition can have one or more replicas, which are copies of the partition’s data.
- Replicas are distributed across different brokers to ensure data durability and resilience to broker failures.
- Kafka maintains consistency among replicas using a leader-follower replication model, where one replica serves as the leader and handles read and write requests while the others act as followers and replicate data from the leader.
- If a broker fails, Kafka automatically elects a new leader from the available replicas to ensure continuous data availability.
ZooKeeper
ZooKeeper plays a crucial role in the architecture of Apache Kafka. It serves as a distributed coordination service responsible for managing and maintaining metadata and state information about the Kafka cluster. Here’s how ZooKeeper fits into the Kafka ecosystem:
Cluster Coordination: ZooKeeper manages the coordination and synchronization of Kafka brokers in a cluster. It elects a leader among Kafka brokers and coordinates the distribution of partitions across brokers.
ZooKeeper ensures that only one broker serves as the controller at any given time, responsible for tasks such as leader election, partition reassignment, and handling administrative requests.
Metadata Storage: Kafka relies on ZooKeeper to store critical metadata about topics, partitions, brokers, and consumer group coordination.
Metadata includes information such as the list of topics, their partition assignments, leader replicas for each partition, and consumer group offsets.
Producers and consumers use ZooKeeper to discover the current state of the Kafka cluster and determine which brokers to connect to for producing or consuming data.
Leader Election: ZooKeeper facilitates leader election for Kafka partitions.
Each partition in Kafka has a leader replica that handles read and write requests for the partition. If the leader replica fails, ZooKeeper helps in electing a new leader from the available replicas to ensure continuous data availability and fault tolerance.
Consumer Group Coordination: ZooKeeper helps manage consumer group coordination in Kafka. It stores consumer group metadata, including consumer offsets, which indicate the position of each consumer within a partition.
Consumers use ZooKeeper to coordinate and balance the consumption of messages across different partitions and consumer instances within a group.
Quorum: ZooKeeper operates in a replicated mode with a quorum of servers forming a ZooKeeper ensemble. Kafka administrators configure an odd number of ZooKeeper servers to ensure fault tolerance and consistency.
ZooKeeper requires a majority of servers (quorum) to be operational to maintain consistency and make progress.
In summary, ZooKeeper acts as the central nervous system of a Kafka cluster, providing distributed coordination, metadata management, and leader election capabilities essential for Kafka’s operation and reliability.
Why use Consumer Offsets?
Offsets are critical for many applications. If a Kafka client crashes, a rebalance occurs and the latest committed offset helps the remaining Kafka consumers know where to restart reading and processing messages.
In case a new consumer is added to a group, another consumer group rebalance happens and consumer offsets are yet again leveraged to notify consumers where to start reading data from.
Therefore consumer offsets must be committed regularly.
In the AMQP/JMS style message broker, once the consumer acknowledges the message received, it will be deleted from the message broker. So, if we add the new consumer, it will start receiving the message when it is registered.
Whereas in Kafaka, once the consumer acknowledges the message received, it will not be deleted from the message broker. So, adding a new consumer group can process the older messages.
In conclusion, with its core capabilities of high throughput, scalability, permanent storage, and built-in stream processing, Kafka enables a new generation of distributed applications capable of handling billions of streamed events per minute.