Demystifying Apache Kafka: A Comprehensive Overview

7 min readSep 23, 2023

Introduction

In today’s data-driven world, businesses are constantly seeking ways to harness the power of real-time data to gain a competitive edge. One technology that has risen to prominence in this quest is Apache Kafka.

In this blog, we’ll take a comprehensive look at Apache Kafka, its architecture, key components, and the different use cases.

What is Apache Kafka?

Apache Kafka is a distributed streaming platform that was originally developed by LinkedIn and later open-sourced as an Apache Software Foundation project. It was created to address the challenges of handling large volumes of data in real time and has since become a cornerstone of modern data architectures.

At its core, Kafka enables the capture, storage, and processing of streams of records or events in a fault-tolerant and highly scalable manner. These records can be anything from log data, sensor readings, and user interactions, to financial transactions — essentially any data that is generated continuously and needs to be processed in real-time.

Kafka can be deployed on bare-metal hardware, virtual machines, containers, and on-premises as well as in the cloud.

Apache Kafka is an event streaming platform. What does that mean?

Kafka combines three key capabilities so we can implement our use cases for event streaming end-to-end with a single battle-tested solution:

To publish (write) and subscribe to (read) streams of events, including continuous import/export of data from other systems.
To store streams of events durably and reliably for as long as we want.
To process streams of events as they occur or retrospectively.

And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner.

How does Kafka work?

In essence, Kafka operates as a distributed system comprising servers and clients, all communicating through a high-performance TCP network protocol.

Servers within a Kafka setup function as a cluster, with some of them serving as storage brokers while others run Kafka Connect to facilitate continuous data import and export as event streams, connecting Kafka with various systems, including relational databases and other Kafka clusters. Kafka clusters are designed for scalability and fault tolerance, ensuring uninterrupted operation even in the event of server failures, without data loss.

On the client side, Kafka empowers developers to create distributed applications and microservices that can read, write, and process event streams concurrently, at scale, and in a fault-tolerant manner, even when facing network issues or machine failures. Kafka comes equipped with built-in clients, and the Kafka community offers numerous additional clients for various programming languages, including Java, Scala, Go, Python, and C/C++, as well as REST APIs. These clients enable seamless integration and utilization of Kafka’s capabilities within a wide range of applications and systems.

Main Concepts and Terminology

1. Event:

An event records the fact that “something happened” in the world or in our business. It is also called a record or message. When we read or write data to Kafka, we do this in the form of events. Conceptually, an event has a key, value, timestamp, and optional metadata headers. Here’s an example event:

Event key: “Reet”
Event value: “Made a payment of $700 to Kumar”
Event timestamp: “September 20, 2023 at 12:26 p.m.”

2. Producers:

Producers are responsible for publishing data to Kafka topics. These data producers could be applications, devices, or any source generating data. Kafka provides client libraries in various programming languages to make publishing data straightforward.

3. Brokers:

Kafka brokers are the heart of the system. They manage the storage, distribution, and retrieval of data. Each Kafka broker is part of a Kafka cluster and stores topic partitions. Kafka clusters can have multiple brokers for fault tolerance and scalability.

4. Topics and Partitions:

Events are systematically organized and persistently stored within topics. In a simplified analogy, envision a topic as akin to a folder within a file system, with the events serving as the files contained within that folder.

In Kafka, topics inherently support multiple producers and multiple subscribers. This means that a given topic can accommodate zero, one, or numerous producers responsible for writing events into it, as well as zero, one, or multiple consumers that subscribe to and read these events. Unlike conventional messaging systems, Kafka does not automatically delete events after they have been consumed. Instead, we have the flexibility to specify how long Kafka should retain the events for each individual topic through a configuration setting. Once this configured retention period elapses, older events within the topic will be purged.

One remarkable attribute of Kafka is its consistent performance, which remains unaffected by the size of the data being managed. Consequently, storing data within Kafka for an extended period is entirely feasible and doesn’t result in performance degradation.

Topics in Kafka are divided into multiple partitions as shown in the above figure, which are distributed across various Kafka brokers. This distributed arrangement of topics plays a crucial role in enhancing scalability. It enables client applications to efficiently interact with multiple brokers simultaneously for both reading and writing data.

When a new event is published on a topic, it is appended to one of the partitions associated with that topic. Events sharing the same event key, such as a customer or user ID, are directed to the same partition. Kafka ensures that any consumer of a specific topic-partition consistently reads events in the exact order they were originally written, thereby maintaining data integrity and reliability.

To enhance the reliability and availability of the data, topics can be replicated, even spanning different geographical regions or data centers. This ensures that multiple brokers always hold a backup of the data, safeguarding against potential issues, broker maintenance, and other unexpected events. In many production environments, it’s common to set a replication factor of 3, meaning there will always be three replicas of the data. Replication is carried out at the topic-partition level.

5. Consumers:

Consumers subscribe to Kafka topics and process the data. Each topic partition is allocated to one consumer instance (per consumer group) only. The number of consumer instances (per group) is less than or equal to the number of partitions, so to support more consumer instances the number of partitions must be increased.

This also guarantees the ordering of events within a partition (only). For a Consumer group, each event is only ever processed by one consumer. However, if multiple consumer groups subscribe to a topic, with a consumer in each, then each consumer receives every message.

Consumer Scenarios:

Number of Partitions equal to number of consumers (in the same group):

2. Number of consumers (in a group) greater than the number of partitions (consumer 5 idle). Kafka can use the idle consumers for failover (active consumer dies) or if the number of partitions increases:

3. Fewer consumers (in a group) than the number of partitions (consumer 3 processes more messages than consumers 1 and 2):

4. Multiple consumer groups. Each event from each partition is broadcast to each group:

The process of maintaining membership in the consumer group is handled by Kafka dynamically. If new instances join the group they will take over some partitions from other members of the group, and if an instance dies, its partitions will be distributed to the remaining instances.

6. ZooKeeper (or Self-managed Metadata in newer versions):

ZooKeeper (in older Kafka versions) is used for managing broker metadata and leader elections. However, newer Kafka versions have introduced self-managed metadata to reduce dependence on external systems like ZooKeeper.

Kafka APIs

Kafka offers a set of core APIs, primarily designed for Java and Scala, alongside command-line tools for effective management and administration. These APIs include:

Admin API: This API enables the management and inspection of topics, brokers, and other essential Kafka components.
Producer API: With the Producer API, we can efficiently publish (write) a continuous stream of events to one or more Kafka topics.
Consumer API: The Consumer API allows us to subscribe to (read) one or more topics and process the stream of events produced to these topics.
Kafka Streams API: Designed for implementing stream processing applications and microservices, this API offers higher-level functions for event stream processing. It supports transformations, stateful operations like aggregations and joins, windowing event-time-based processing, and more.
Kafka Connect API: This API is used to create and execute reusable connectors for data import/export. These connectors can consume (read) or produce (write) streams of events from/to external systems and applications, facilitating seamless integration with Kafka.

Kafka Use Cases

Kafka can be used for a variety of use cases such as log aggregation, messaging, audit trail, stream processing, website activity tracking, monitoring, and more.

Conclusion

Apache Kafka has revolutionized the way businesses handle real-time data. Its distributed streaming platform provides a scalable, fault-tolerant, and highly performant solution for ingesting, storing, and processing data streams. Whether we’re building a real-time analytics platform, a recommendation engine, or a monitoring system, Kafka’s architecture and capabilities make it a good choice.

In subsequent articles, we will delve deeper into Kafka’s components, explore Kafka Streams for stream processing, and provide practical guides on how to harness the full potential of Apache Kafka in your applications.

Happy Learning !!!

References:

https://kafka.apache.org/documentation/