Understanding Apache Kafka

4 min readApr 14, 2023

Before diving into Apache Kafka, it’s important to understand the publish-subscribe pattern and the benefits it provides.

Publish-Subscribe with message queues:

In a distributed system, multiple running processes communicate over the network to exchange messages. The traditional approach involves a strong static binding between senders and receivers, resulting in strong coupling and management complexity. The publish-subscribe pattern, or pub sub, provides a clean and scalable solution by using a message queue where each sender only needs to know about the queue, and subscribers only need to subscribe to the queue. Pubsub allows for decoupling of publishers and subscribers, easy management of the setup, scalability, back-pressure handling, and reliability through persistent queue data and consumption tracking. Apache Kafka is a technology that provides these advantages.

What is Kafka?

— Events/messages streaming platform (events or messages represents the actual data that is exchanged through Kafka)

— Critical piece of the Big Data puzzle, and plays an integral part in many big data pipelines

— Open-Source with commercial options, can be downloaded and deployed free of cost

— Producers (called Publishers) push messages to Kafka

— Consumers (called Subscribers) listen and receive messages.

Kafka:

Collects messages from multiple producers concurrently
Provides persistent storage of the messages received, which provides fault tolerance capabilities.
Transports data across from producers to consumers. With mirroring capabilities, it can also transport across networks.
Distributes data to multiple concurrent consumers for a downstream processing
Tracks message consumption by each consumer.

Benefits of Kafka

High Throughput: Kafka is designed to handle a large amount of data in real-time. It can process millions of messages per second, making it a good choice for use cases that require high throughputs, such as log aggregation, real-time analytics, and stream processing.
Low Latency: Kafka can deliver data with low latency, which is important for use cases that require real-time data processing. Kafka’s design ensures that data is processed and delivered as quickly as possible, minimizing the time between data creation and consumption.
Fault Tolerance: Kafka is highly fault-tolerant and can continue to operate even if a broker node fails. It uses replication to ensure that data is not lost even in the event of a node failure.
Decoupling: Kafka enables decoupling of data producers and consumers, allowing them to operate independently of each other. This provides greater flexibility and scalability for data processing and consumption.
Back Pressure Handling: Kafka provides back pressure handling, which allows the system to slow down or stop data production when the consumers are not able to keep up with the data processing speed. This ensures that the system does not get overwhelmed with data, which can lead to performance degradation.
Horizontal Scalability: Kafka’s distributed architecture makes it highly scalable. It can easily scale horizontally by adding more broker nodes to the cluster, which increases the system’s capacity to handle more data.
Streaming and Batching: Kafka provides both streaming and batching capabilities, allowing data to be processed in real-time or in batches, depending on the use case. This makes Kafka a flexible choice for a wide range of data processing and consumption scenarios.

Kafka Use Cases:

Messaging: Kafka can be used as a messaging system to transport data between different applications or microservices within a distributed architecture.
Log Aggregation: Kafka can be used to centralize logs from multiple applications or servers, making it easier to manage and analyze log data.
Stream Processing: Kafka can be used as a stream processing platform, allowing for real-time data processing and analysis of large-scale data streams.
Event Sourcing: Kafka can be used to implement event sourcing, which involves storing all changes to an application state as a sequence of events in a Kafka topic.
Metrics Collection: Kafka can be used to collect and aggregate metrics from various systems, enabling real-time monitoring and alerting.
Commit Log: Kafka can be used as a commit log for distributed systems, allowing for durable storage of data changes and ensuring data consistency and reliability.
Clickstream Data Processing: Kafka can be used to process clickstream data in real-time, enabling personalized recommendations, ad targeting, and other applications that require immediate user feedback.
IoT Data Processing: Kafka can be used as a data processing platform for IoT devices, allowing for real-time data processing and analysis of large-scale data streams from sensors and other devices.

Notes based on the course Apache Kafka Essential Training: Getting Started by Kumaran Ponnambalam