Foundational Concepts of Kafka and Its Key Principles

Published in

The Life Titbits

7 min readMay 2, 2023

Apache Kafka is a distributed event store and stream-processing platform. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Image from — https://developer.confluent.io/learn-kafka/architecture/get-started/

Kafka can connect to external systems (for data import/export) via Kafka Connect and provides the Kafka Streams libraries for stream processing applications.
It has numerous use cases including distributed logging, stream processing, data integration, and pub/sub messaging.
Kafka is a data streaming system that allows developers to react to new events as they occur in real time.
Kafka architecture consists of a storage layer and a compute layer. The storage layer is designed to store data efficiently and is a distributed system such that if your storage needs grow over time you can easily scale out the system to accommodate the growth.

The compute layer consists of four core components —

the producer,
consumer,
streams, and
connector APIs, which allow Kafka to scale applications across distributed systems.

1. Producer and Consumer APIs: The foundation of Kafka’s powerful application layer is two primitive APIs for accessing the storage — the producer API for writing events and the consumer API for reading them. On top of these are APIs built for integration and processing.

2. Kafka Connect: Kafka Connect, which is built on top of the producer and consumer APIs, provides a simple way to integrate data across Kafka and external systems. Source connectors bring data from external systems and produce it to Kafka topics. Sink connectors take data from Kafka topics and write it to external systems.

3. Kafka Streams: For processing events as they arrive, we have Kafka Streams, a Java library that is built on top of the producer and consumer APIs. Kafka Streams allows you to perform real-time stream processing, powerful transformations, and aggregations of event data.

Foundational Concepts

Message

A message is a record of information. Each message has an optional key which is used for routing the message to an appropriate partition in a topic, a mandatory value which is the actual information. Both key and value of the message are arrays of bytes.

Kafka Topics

A topic is a log of events.
Apache Kafka’s most fundamental unit of organization is the topic, which is something like a table in a relational database.
You create different topics to hold different kinds of events and different topics to hold filtered and transformed versions of the same kind of event.

Imp Properties

First, they are append only: When you write a new message into a log, it always goes to the end.
Second, they can only be read by seeking an arbitrary offset in the log, then by scanning sequential log entries.
Third, events in the log are immutable — once something has happened, it is exceedingly difficult to make it un-happen.
Logs are also fundamentally durable things. Traditional enterprise messaging systems have topics and queues, which store messages temporarily to buffer them between source and destination.
Every topic can be configured to expire data after it has reached a certain age

Kafka Partitioning

In order to distribute the storage and processing of events in a topic, Kafka uses the concept of partitions. A topic is made up of one or more partitions and these partitions can reside on different nodes in the Kafka cluster.
The partition is the main unit of storage for Kafka events, although with Tiered Storage, which we’ll talk about later, some event storage is moved off of partitions.
The partition is also the main unit of parallelism. Events can be produced to a topic in parallel by writing to multiple partitions at the same time.
Likewise, consumers can spread their workload by individual consumer instances reading from different partitions. If we only used one partition, we could only effectively use one consumer instance.

How Partitioning Works

Having broken a topic up into partitions, we need a way of deciding which messages to write to which partitions. Typically, if a message has no key, subsequent messages will be distributed round-robin among all the topic’s partitions.
For example, if you are producing events that are all associated with the same customer, using the customer ID as the key guarantees that all of the events from a given customer will always arrive in order.

Kafka Brokers

From a physical infrastructure standpoint, Apache Kafka is composed of a network of machines called brokers.
they are independent machines each running the Kafka broker process.
Each broker hosts some set of partitions and handles incoming requests to write new events to those partitions or read events from them. Brokers also handle the replication of partitions between each other.

Cluster

Brokers operate as part of the cluster to share the load and provide fault tolerance.

Offset

Each message is uniquely identified by a topic, the partition it belongs and the offset number. Offset is a continually increasing integer that identifies a message uniquely given a topic and a partition. Messages are ordered within a partition by offset number.

Replication

Data replication is a critical feature of Kafka that allows it to provide high durability and availability. We enable replication at the topic level.
When a new topic is created we can specify, explicitly or through defaults, how many replicas we want. Then each partition of that topic will be replicated many times.
This number is referred to as the replication factor. With a replication factor of N, in general, we can tolerate N-1 failures, without data loss, and while maintaining availability.
Every read and write to the partition goes through the leader.
A message is considered committed only if all in-sync replicas write the message to their write-ahead log.
Producers can configure acks policy separately.

Kafka Producers

The API surface of the producer library is fairly lightweight: In Java, there is a class called KafkaProducer that you use to connect to the cluster.
To a first-order approximation, this is all the API surface area there is to producing messages. Under the covers, the library is managing connection pools, network buffering, waiting for brokers to acknowledge messages, retransmitting messages when necessary, and a host of other details no application programmer need concern herself with.

Consumers

Using the consumer API is similar in principle to the producer. You use a class called KafkaConsumer to connect to the cluster
First of all, Kafka is different from legacy message queues in that reading a message does not destroy it;
In fact, it’s perfectly normal in Kafka for many consumers to read from one topic.

Kafka Connect

On the one hand, Kafka Connect is an ecosystem of pluggable connectors, and on the other, a client application. As a client application, Connect is a server process that runs on hardware independent of the Kafka brokers themselves.

Kafka Streams?

Kafka Streams is a Java API that gives you easy access to all of the computational primitives of stream processing: filtering, grouping, aggregating, joining, and more, keeping you from having to write framework code on top of the consumer API to do all those things.
It also provides support for the potentially large amounts of state that result from stream processing computations

Event in Stream Processing?

An event is a record of something that happened that also provides information about what happened. Examples of events are customer orders, payments, clicks on a website, or sensor readings.
An event record consists of a timestamp, a key, a value, and optional headers. The event payload is usually stored in the value. The key is also optional.

Kafka Persistence

Each batch of the message published is stored in the active log segment of the partition in exactly the same format as published by the producer. The message format is consistent across producers, consumers, and brokers removing the overhead of serialization and deserialization.
Kafka uses a page cache for reads and writes. All reads will directly happen from the page cache and the write first gets applied to the page cache and synced periodically.

Log compaction

The retention policy on Kafka topics can be configured as either “compact” or “delete”. Delete purges old segment files based on log.retention.bytes or log.retention.ms
If the retention policy is configured as compact, Kafka will retain only the latest message for every message key. This is especially useful if we are getting a lot of updates for a given key and we are interested in only the latest message. Ex: user update event.

Leader Election

When the broker crashes, it fails to send a heartbeat to Zookeeper. Zookeeper’s session times out and it notifies the cluster controller of the broker failure. The cluster controller fetches all partitions for which the broker was the leader, picks up the next entry in ISR, and promotes it as a leader.

Don’t forget to hit the Clap and Follow buttons to help me write more articles like this.
And, if you are looking for summarized articles on Apache Kafka, you can also check my previous articles like Why is Apache Kafka fast?

References

Foundational Concepts of Kafka and Its Key Principles

Foundational Concepts

Written by Mahesh Saini