Inside Apache Kafka: An In-Depth Analysis of Kafka’s Internal Architecture

Sutanu Dutta
7 min readSep 3, 2023

--

Apache Kafka is a popular name when we think of building distributed application which involves real-time data processing and streaming. Today we will dive deep into the internals of Kafka and try to figure out why is it a preferred choice for handling high-throughput, fault-tolerant realtime data processing pipeline. We will have a closer look at the individual components and understand how it all makes sense from system design aspect.

At its core, Apache Kafka is a distributed streaming platform designed to handle real-time data streams. It was originally developed at LinkedIn and later open-sourced as an Apache project. Kafka is built around the concept of publishing and subscribing to data streams, where data is organised into topics and can be consumed by multiple subscribers.

Let’s look at the core components of Kafka:

  • Producers: The applications or processes which publishes message to the Kafka topic.
  • Consumers: The application or processes which consumes messages from the Kafka topic.
  • Brokers: Acts like a buffer between producer and consumer. It is responsible for replication, durability and delivery guarantee.
  • Topics: The topics are the entities which store the messages published to it.
  • Partitions: Each topic is divided into multiple partitions; Kafka ensures that messages are never lost through partition replication once it’s acknowledged.

Let’s understand the reasoning behind some of the architectural design decision taken by Kafka team:

Why Kafka uses File System?

We have highlighted a key feature of Kafka that it’s a high throughput system but at the same time we need to understand what is the rationale of using file system to store messages. Kafka leverages the modern operating system page cache optimisation where the data is kept in main memory(page cache) once read till it gets evicted(minor overhead). Kafka first writes to page cache and then the dirty pages get written to the file system efficiently.

The design choice was heavily influenced by the disk seek performance improvement that we have seen in recent times.

Why Kafka doesn’t use any In-Memory Data Structure or B-Tree structure like RDBMS?

Kafka is built on top of JVM hence it makes sense to use any efficient In-Memory data structure; but the Kafka design team realised that custom objects take considerable amount of memory due to the metadata it carry. Kafka is designed to deal with high volume of data hence maintaining a large object pool in the available main memory would eventually turn into a bottleneck instead it’s much better choice to store compact byte structure.

Kafka maintains it’s data into files which work pretty similar to write ahead log(WAL) file implementation. It is understandably slower when we compare with B-Tree like structure. Kafka provides a feature where we can maintain the produced data as long as we want which is easier to accommodate with cheaper storage. Considering the fact that performance of tree based structures are super linear, if we need to implement retention feature with B-Tree like structure the performance would degrade with higher volume whereas the file based implementation is volume agnostic.

It’s generally acknowledged disk reads are slow. How Kafka is still so performant?

This is where things get interesting. We need to understand how data used to be transferred over the network from disk earlier.

There were 4 steps:

  1. The operating system reads data from the disk into page cache in kernel space.
  2. The application reads the data from kernel space into a user-space buffer.
  3. The application writes the data back into kernel space into a socket buffer.
  4. The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network.

Modern O/S implements Zero-Copy optimisation which allows data to be transferred over network from NIC buffer, that gets the data directly from page cache, effectively reducing two hops in between.

Kafka assumes all the messages would eventually be consumed by one or more consumers. Upon first read of the message, it stays in page cache and any subsequent consumption happens from page cache without looking up to the disk.

Is there anything else Kafka team thought that would improve performance?

Certainly there are other areas of optimisation. We haven’t discussed about the message size. Kafka team defaults the message size to 1 MB and anything beyond is considered as anti-pattern for Kafka. When we think of a real time data streaming pipeline more often the bottleneck is network bandwidth. Kafka addresses it through the usage of compression in bulk. Some of the compression mechanisms are GZip, Snappy and LZ4 etc.

Kafka uses standardised binary message format which is understood by all the pieces of Kafka infrastructure(Broker, Producer and consumer)

Kafka uses RecordBatch interface which groups the messages together to reduce network roundtrip which effectively addresses small I/O problem.

How Kafka ensures durability?

Kafka messages get pushed into topics. Each topics has multiple partitions. Each partition is replicated into multiple servers to ensure durability. Out of all the instances; Kafka maintains one leader and rest replicas turn into followers. Leader ensures the data gets written to the follower’s log file successfully, so that in case the current leader fails any of the follower can turn into leader.

I have written couple of articles on leader election hence I won’t go into much detail but it’s worth noting that Kafka uses a slightly different approach here:

Kafka dynamically maintains a set of in-sync replicas (ISR) that are always in sync with the leader. Only members of this set are eligible for election as leader. A write to a Kafka partition is not considered committed until all in-sync replicas have received the write.

Let’s talk about message delivery guarantee?

Below are the possible message delivery guarantees:

  • At most once — Messages may be lost but are never redelivered.
  • At least once — Messages are never lost but may be redelivered.
  • Exactly once — this is what people actually want, each message is delivered once and only once.

Kafka by default follows “At least once” policy but with co-ordination(additional implementation of something like 2-Phase commit) with consumer application we can achieve “Exactly Once” guarantees.

How data is actually stored in file system?

Let’s drill down the storage at a partition level of a topic. We have multiple segment files and index files. The log file contains below information:

  • BaseOffset : Offset of first message in the batch
  • LastOffset : Offset of last message in the batch
  • Count : Number of messages in the batch
  • Position : Position of the batch in the file
  • CreatedTime :Created time of last message in the batch
  • Size : Size of the batch(in bytes)
  • Messages : List of messages(& its details) in the batch

The index file contains information related to mapping of relative offset to position in the log file. It contains below information:

  • Offset: Relative offset of the message
  • Position: Position in log file.

Kafka uses index & log files to quickly extract the message for a given offset.

How Kafka maintains the data for so long?

Kafka needs to retain messages and to do that it needs to optimise the storage. One of the optimisation technique is log compaction. Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition. The idea is to selectively remove records where we have a more recent update with the same primary key. This way the log is guaranteed to have at least the last state for each key. Anytime there is a failure, system would be able to bring the last committed state back by reprocessing the messages from the log.

Log compaction is a mechanism to give finer-grained per-record retention, rather than the coarser-grained time-based retention.

It’s worth noting that the compaction activity doesn’t change the existing offset by adjusting for the deleted records.

Why do consumers need to pull the data from Kafka broker?

  • In a push-based system it’s difficult to deal with diverse consumers because the broker would control the rate at which data is pushed.
  • In a push system a consumer can get overwhelmed when its rate of consumption falls below the rate of production (similar to a denial of service attack).
  • Pull-based system enables the consumer to catch up when it can, if it falls behind production. The consumer can indicate it is overwhelmed and catch up can be implemented with some kind of backoff protocol.
  • Pull-based system enables aggressive batching of data sent to the consumer. In contrast, a push-based system must either send a request immediately or accumulate more data and then sends it later without knowledge of whether the downstream consumer will be able to process it immediately.

Conclusion:

Apache Kafka’s internal architecture is a well-engineered system designed to handle real-time data streams at scale. Its core components work together to provide data durability, fault tolerance, scalability, and efficient data processing. Understanding Kafka’s internals is essential for building robust, high-performance streaming applications and data pipelines in today’s data-driven world. As Kafka continues to evolve, it remains a fundamental tool for organisations dealing with large volumes of real-time data.

If you are done reading this far; please checkout part 2 of Kafka series- “kafka for real world”

Reference:

  • Kafka documentation- docs
  • Kafka storage internal- gitpage

--

--

Sutanu Dutta

Senior software engineer and system design enthusiast. I am passionate about Computer science and write about data structure and software architecture.