Understanding Kafka — A Distributed Streaming Platform

Nimesh Khandelwal

Published in

The Startup

7 min readJul 2, 2019

Overview of Industry most popular streaming platform

Introduction

Kafka is a new buzzword in the industry nowadays. Almost every leading tech company is using Kafka on its platform. But the question is what it is and why so much hype?

Kafka is a distributed, horizontally scalable streaming platform. It is an open-source stream processing platform. Kafka originated at LinkedIn and later became an open-source Apache project in 2011, then a first-class Apache project in 2012. Kafka is written in Scala and Java. It aims at providing a high-throughput, low-latency platform for handling real-time data feeds.

How Kafka works

Kafka is based on the pub/sub model. It’s similar to any messaging system. Applications (producers) send messages (records) to a Kafka node (broker) and said messages are processed by other applications called consumers. Messages get stored in a topic and consumers can subscribe to the topic and listen to those messages. Messages can be anything like sensors measurements, meter readings, user actions, etc.

A Topic can be thought of as a category/feed name to which messages are stored and published.

Partitions

Kafka topics can be very big in size so it might not be possible to store all data of a topic in a single node, so data need to be divided into multiple partitions. Partitions allow us to parallelize a topic by splitting the data in a particular topic across multiple brokers(Kafka nodes) i.e each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel. Partitioning of a topic can be based on anything, like on order they came in, hashing, id, etc.

To increase the availability of partitions, each partition also has replicas. To understand it better let’s have a Kafka cluster consisting of 3 nodes/broker.

Now a topic is split into 3 partitions and each broker has a copy of each partition. Among these copy of partition, one is elected as leader, while others just passively make themselves in sync with Leader.

There are 3 partitions of a topic named as Partition 0, Partition 1 and Partition 2

All writes and reads to a topic go through the respective partition leader and the leader coordinates updating replicas with new data. If a leader fails, a replica takes over as the new leader.

Write is only through leader and replicas get updated asynchronously.

For a producer/consumer to write/read from a partition, they need to know its leader, right? This information needs to be available from somewhere.
Kafka stores such metadata in a service called Zookeeper.

Log Anatomy

The key to Kafka performance and scalability is log. Developers often get confused when first hearing about this “log,” because we’re used to understanding “logs” in terms of application logs. What we’re talking about here, however, is the log data structure. Log is a persistent ordered data structure which only supports appends. You cannot modify nor delete records from it. It is read from left to right and guarantees item ordering.

A data source writes messages to the log and one or more consumers reads from the log at the point in time they choose.

Each entry in this log is uniquely identified by offset, which is nothing but a sequential index number just like an array.

As this sequence/offset can only be maintained on a particular node/broker and can’t be maintained in the whole cluster, thus Kafka only guarantees to order of data per partition. If your use-case requires complete ordering of data, then you can define the partition key accordingly.

Persistence of Data in Kafka

Kafka actually stores all of its messages to disk and not in RAM and having them ordered in the log structure lets it take advantage of sequential disk reads and writes.

It's a common how can it be a good choice to store data on a hard disk and how is it still performant, well there are numerous reasons behind it

Kafka has a protocol that groups messages together. This allows network requests to group messages together and reduces network overhead, the server, in turn, persist chunk of messages in one go and consumer fetch large linear chunks at once, thus reducing disk operations
Kafka relies heavily on OS pagecache for data storage i.e using free RAM on machine effectively.
Kafka stores messages in a standardized binary format unmodified throughout the whole flow (producer->broker->consumer), it can make use of the zero-copy optimization. That is when the OS copies data from the pagecache directly to a socket, effectively bypassing the Kafka broker application entirely.
Linear reads/writes on a disk are fast. The concept that modern disks are slow is because of numerous disk seeks. Kafka does linear read and writes, thus making it performant.

Consumer and Consumer Groups

Consumers read from any single partition, allowing you to scale throughput of message consumption in a similar fashion to message production. Consumers can also be organized into consumer groups for a given topic — each consumer within the group reads from a unique partition to avoid two processes reading the same message twice and the group as a whole consumes all messages from the entire topic. If you have consumers > partitions then some consumers will be idle because they have no partitions to read from. If you have partitions > consumers then consumers will receive messages from multiple partitions. If you have consumers = partitions, each consumer reads messages in order from exactly one partition.

It can further be elaborated through an image from Kafka documentation

Server 1 holds partitions 0 and 3 and server 2 holds partitions 1 and 2. We have two consumer groups, A and B. A is made up of two consumers and B is made up of four consumers. Consumer Group A has two consumers so each consumer reads from two partitions. Consumer Group B, on the other hand, has the same number of consumers as partitions so each consumer reads from exactly one partition.

Kafka follows the principle of a dumb broker and smart consumer. This means that Kafka does not keep track of what records are read by the consumer and thus unaware of consumer behavior and retains messages for a configurable period of time and it is up to the consumers to adjust their behavior accordingly. Consumers themselves poll Kafka for new messages and say what records they want to read. This allows them to increment/decrement the offset they’re at as they wish, thus being able to replay and reprocess events in case of an accident.

For instance, if Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages. However, if the consumer is down for an hour it can begin to read messages again starting from its last known offset.

Role of Zookeeper

Zookeeper is a distributed key-value store. It is highly-optimized for reads but writes are slower. Kafka uses Zookeeper to do the leadership election of Kafka Broker and Topic Partition. Zookeeper is also extremely fault-tolerant and it ought to be, as Kafka heavily depends on it.

It is used for storing all sort of metadata, to mention some:

The consumer group‘s offset per partition (although modern clients store offsets in a separate Kafka topic)
ACL (Access Control Lists) — used for limiting access/authorization
Producer & Consumer Quotas — maximum message/sec boundaries
Partition Leaders and their health

Producer and consumers don’t directly interact with Zookeeper to know about the leader of a partition and other metadata, instead, they do metadata requests to Kafka Broker which in turn interact with Zookeeper and give metadata response and then producer and consumer get informed to which partition they need to write and read form respectively.

Conclusion

Kafka is quickly becoming the backbone of any organization’s data pipelines — and with good reason. Kafka allows you to have a huge amount of messages go through a centralized medium and store them without worrying about things like performance or data loss.

Kafka can be the centerpiece of event-driven architecture and allows you to truly decouple applications from one another.