Apache Kafka

Published in

Big Data Processing

8 min readJul 30, 2021

Apache Kafka is an open-source, distributed, event streaming platform. Apache Kafka allows the decoupling of data streams and processing systems (and used as a communication channel among them). The source system produces data and writes to Kafka and the consumers receive them by reading them from Kafka.

Fig a: Sources writes to Kafka and consumers read from Kafka

Some of the important features of Kafka are:

Distributed
Resilient
Fault-tolerant (Due to replication)
Horizontal scalability (Can scale up to 100s of brokers and millions of messages per second)
High performance (Very low latency)

Some of the major use cases of Kafka in the real world include:

Messaging System: Kafka replaces the traditional message brokers. Message brokers decouple the producer-consumer interaction. It runs as a server, with producers and consumers connecting to it as clients. Compared to other existing messaging systems, Kafka has better throughput, built-in partitioning, replication, and fault tolerance.
Activity Tracker: Activity tracking often comes with a very high volume of data generated based on activity(E.g: Website activity tracking which tracks page views, clicks, searches). According to the creators of Apache Kafka, the original use case for Kafka was to track website activity. Activity tracking often requires a very high volume of throughput because messages are generated for each user action which makes Kafka a better choice.
Metrics collection: Metrics are the indicators of a system’s performance. Usually, metrics are collected in real-time. This implies that metrics data are constantly generated and can be sent as a data stream. As Kafka can stream data, it can be used for collecting metrics.
Log Aggregation: Log aggregation consolidates log data from the application into a single centralized platform where it can be reviewed and analyzed. With Kafka, we get a cleaner abstraction of log data as a stream of messages and this allows for lower-latency processing and easier support for multiple data sources and distributed data consumption.
Stream Processing: Stream processing is the processing of data in motion or in other words computing on data directly as it is produced or received. Users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then processed into new topics for further consumption or follow-up processing.

Some of the real-world use cases of Kafka includes:

Netflix uses Kafka to apply recommendations in real-time while you are watching
Uber uses Kafka to gather user taxi and trip data in real-time to compute and forecast demand and compute surge pricing in real-time
LinkedIn uses to prevent spam, collect user interactions to make better connection recommendations in real-time

Now, let us look into some of the Kafka terminologies and how Kafka actually works.

Kafka Topic

In Kafka, events are organized and durably stored in topics. A topic is a stream of data. One can have as many as topics you want and each topic is identified by its name.

Each topic is split into partitions. The number of partitions can be configured while creating a topic. Messages are appended to a topic partition in the order they are sent. Partitions have numbers that start from 0. Each partition is ordered and each message within a partition gets an incremental id called offset. Offset has scope only within a partition.

In figure b, we have topic 1 which has 3 partitions(0, 1, and 2). And each message in the partition has got an offset. The last message written in Partition 0 had an offset of 4 and hence, the new message will have an offset of 5. As already mentioned, offset has scope only within the partition, and hence, offset 1 in partition 0 has different data from that of offset 1 in partition 2 or 3.

In Kafka, the order is guaranteed only within a partition. Kafka doesn’t provide ordering guarantees across partitions. There are various ways to partition the data for a topic. Unless a key is specified, data is randomly written to a partition. Kafka retains data only for a limited time period. But the data is immutable, i.e, once data is written to a partition, it cannot be updated.

Broker

A Kafka cluster is composed of multiple brokers. A broker is a server and it holds the partition. Each broker is identified by a unique identifier which is an integer. Each broker contains certain topic partitions.

In figure c, we have 3 brokers and each broker holds partitions of different topics.

Kafka uses replication for failover. Thus, if a broker is down, another one can serve the data. The ideal replication factor is 3 in order to survive failures. At any time, only 1 broker can be a leader for a given partition. Only the leader can receive and serve data for a partition. Other brokers will synchronize the data.

Fig d: Kafka cluster with 3 brokers and replication factor 2

Figure d depicts a Kafka cluster with 3 brokers and replication factor 2. We have a topic (Topic A) with 4 partitions. As the replication factor is 2, each partition is written to 2 different brokers (The leader and replica are denoted in different colors).

If a broker fails, as in figure e, data wouldn’t be lost because partitions are already replicated. Topic A Partition 1 can be served from Broker 1 and Topic A Partition 2 can be served from Broker 2.

Broker discovery

Every broker in a Kafka cluster can also be a bootstrap server. A Bootstrap Servers are a list of host/port pairs that are used to establish the initial connection to the Kafka cluster. Hence, when you connect to a bootstrap broker, you will be connected to the entire cluster. Each broker in the cluster knows about all other brokers, topics, and partitions. When a client connects to a Kafka cluster, it connects to one of the brokers and requests the metadata about other brokers. The broker returns the metadata which consists the information about other brokers, topics, and partitions. Clients use this info to connect to the required broker in the cluster. Brokers are managed by Zookeeper. Zookeeper helps in performing leader election for partitions and notifies Kafka in case of changes.

Kafka Producers

Kafka producers are the client applications that publish (write) events to Kafka. Producers essentially write data to topics. Producers automatically know which broker and partition to write and when a broker fails producers automatically recover.

Fig f: A producer writing data to multiple partitions

Producers usually send a key along with the message. A key can be anything in Kafka such as Integers or String. The data is sent to different partitions based on the key. The Partitioner uses hashing to compute the partition for a record if the key is defined. Hence, all the message with the same key goes to the same partition as long as the number of partitions remains constant. If there is no key specified, then the data is sent across the partition in a round-robin.

Producers can choose to receive acknowledgment of data writes. There are 3 possible acknowledgment models:

acks=0: producer does not wait for an acknowledgment. There is a possibility for data loss in this model.
acks=1: producer waits for the acknowledgment from the leader. This still has a limited possibility for data loss from the replicas.
acks=all: producer waits for acknowledgment from the leader and all the replicas. This guarantees no data loss.

Kafka Consumers

Kafka consumers are the applications that subscribe (read and process) to the events produced by the producer. Consumers read data from partitions in the order they are stored. Data is read in order within each partition. But there is no guarantee for order across partitions.

Fig g: Consumers reading events from different partitions

Consumer Groups

A consumer group is a set of consumers who cooperate to consume data from some topics. Consumer groups each have unique offsets per partition. Each consumer within a group read from exclusive partitions. Different consumer groups can read from different locations in a partition.

Fig h: Example for different consumer group

As each consumer within a partition read from exclusive partitions, if you have more consumers than partitions, some of the consumers would be inactive(Fig h, consumer group 2).

Consumer Offsets

Kafka stores the offset which the consumer group has been reading under the Kafka topic __consumer_offsets. The offset is committed live to this topic. When a consumer in the group has processed the data from Kafka, it will commit that offset. Hence, when a consumer dies, it can later restart reading from the offset which is already committed. Consumers can choose when to commit an offset based on the delivery semantics configured. There is 3 possible semantics:

Atmost Once: Offsets are committed as soon as the message is received. Hence, if the processing goes wrong, the message will be lost. In this semantics, a message will be only read once.
Atleast Once: Offset is committed only after the message is processed. Atleast Once is usually preferred over Atmost Once. But this can result in duplicate processing of messages.
Exactly once: In this semantics, even if a producer retries sending a message, it leads to the message being delivered exactly once to the end consumer.

Consumers automatically know which broker to read events from. In case of broker failures, consumers recovers.

To conclude, Apache Kafka is a high-performance, scalable, fault-tolerant, and scalable distributed messaging system. These features enable users to build real-time distributed applications for various use cases.

References:

Apache Kafka Series — Learn Apache Kafka for Beginners By Stéphane Maarek
https://kafka.apache.org/uses