Understanding Apache Kafka: A Deep Dive into its Architecture

Raheel Butt
9 min readApr 29, 2024

--

Kafka

In today’s interconnected digital landscape, real-time data processing is crucial for powering various applications, from e-commerce platforms to IoT devices. Message streaming architectures play a pivotal role in enabling real-time communication and data processing.

Apache Kafka has become a cornerstone technology in the world of real-time data processing and streaming applications. Its robust architecture enables handling massive volumes of data with high throughput, fault tolerance, and scalability.

Definition: Apache Kafka is an open-source, distributed message streaming platform that uses the publish and subscribe mechanism to stream the records in key-value pairs, that allows for the development of real-time event-driven applications, Specifically, It Allows Developers to that continuously Produce
and Consume
streams of data records. For Each Produce and Consume, there will be a specific Topic

Developed by LinkedIn now maintained by the Apache foundation.

Example: Let’s consider a simplified example of how Apache Kafka could be utilized in a real-time process for Uber, involving drivers, riders, and their trip data by including a database component and compare the process with and without Apache Kafka.

Without Apache Kafka:

  • Data Ingestion: Rider requests are directly stored in the database upon receipt, triggering an insertion operation.
  • Driver Matching: The matching algorithm periodically queries the database for new ride requests, performs matching calculations, and updates the database with driver assignments.
  • Trip Execution: Location updates from the driver and rider apps are directly stored in the database, triggering frequent read and write operations.
  • Real-time Updates: Both driver and rider apps continuously query the database for trip updates, resulting in high database load and potential latency issues.
  • Trip Completion: Upon trip completion, trip data is updated in the database, triggering additional write operations for fare calculation, account updates, and archival.

With Apache Kafka:

  • Data Ingestion: Rider requests are ingested into Kafka topics for ride requests.
  • Driver Matching: The matching algorithm consumes ride request events from Kafka, processes them in real-time, and produces driver assignment events to Kafka topics for ride assignments.
  • Trip Execution: Trip tracking data, including location updates from both the driver and rider, is continuously streamed to Kafka topics for real-time updates.
  • Real-time Updates: Both driver and rider apps consume real-time updates from Kafka topics, ensuring they have accurate trip information throughout the journey.
  • Trip Completion: Upon trip completion, the driver’s app sends a trip completion event to Kafka, which is processed by Uber’s backend system. Trip data is archived and stored in a database for analytics and reporting purposes.

In summary, Apache Kafka enhances the efficiency, scalability, and fault tolerance of Uber’s ride-hailing process by serving as a reliable, real-time messaging system. Its integration with a database complements Kafka’s capabilities by providing persistent storage for archival and analytics purposes. Without Kafka, the process may suffer from scalability issues, latency, and complexity in managing real-time interactions and data integration.

Component of Apache Kafka’s architecture

  • Topics: Data in Kafka is organized into topics, which are essentially feeds of messages. Topics can be thought of as a category or a stream of data. Each message in Kafka is a key-value pair consisting of a key, a value, and a timestamp. Topics are partitioned to enable parallel processing and scalability. You can think of a topic as a table in a database (without all the constraints), and uniquely identified by its name.
  • Partitions: Each topic is divided into one or more partitions. Partitions allow Kafka to scale horizontally by distributing data across multiple servers. Each partition is an ordered, immutable sequence of messages. Each message in a partition gets an incremental ID, called Offset. The Good number for having partitions for a Topic is 10 partitions per topic and 10,000 partitions per Kafka cluster.
Kafka Partition
  • Brokers: The Kafka cluster consists of one or more servers called brokers. Brokers are responsible for storing and managing data, handling requests from the right producers and consumers, and replicating data across the cluster for fault tolerance. Each broker can have one or more partitions of different topics. Each broker is identified by an ID (Integer). Brokers can be added dynamically to a cluster (Horizontal Scrolling).
  • Cluster: A Kafka cluster is a group of one or more Kafka brokers working together to manage and store data across multiple topics and partitions.
  • Cluster Controller: The cluster controller is a special broker within a cluster responsible for managing administrative tasks within the Kafka cluster and these responsibilities make this broker a special broker from other brokers of the same cluster. It monitors the health of other brokers, handles leader elections, and coordinates partition reassignment and broker metadata updates.
  • Producers: Producers are applications that publish data to Kafka topics. They write messages to Kafka brokers, specifying the topic and, optionally, a key. Producers can choose to send messages to a specific partition or let Kafka choose the partition using a partitioning strategy. If the Key is passed then the Producer has a guarantee that all messages for that key will always go to the same partition. Producers can choose to receive acknowledgment of data writes:
  • Acks = 0: Producer won’t wait for an acknowledgment (possible data loss).
  • Acks = 1: Producer will wait for leader acknowledgment (limited data loss).
  • Acks = all: Leader + replica acknowledgment (no data loss).
  • Consumers: Consumers are applications that subscribe to topics and process the messages in them. Consumers read messages from partitions in a topic in the order they were written. Kafka allows for both parallel processing and fault tolerance by allowing multiple consumers to form consumer groups. Each message in a partition is consumed by only one consumer within a consumer group.
  • Consumer Groups: Consumer groups are sets of consumers that jointly consume and process data from a topic. Each consumer within a group is assigned to one or more partitions of the topic. Kafka ensures that messages within a partition are delivered to only one consumer in each consumer group, enabling parallel processing while maintaining order. You can’t have more consumers than partitions (otherwise some will be inactive).
Consumer Groups
  • Replication: Kafka provides fault tolerance through data replication. Each partition has one leader and one or more followers. The leader handles all read and write requests for the partition, while the followers replicate the data from the leader. If a leader fails, one of the followers is elected as the new leader.
  • ZooKeeper: Kafka uses Apache ZooKeeper for managing and coordinating its cluster. ZooKeeper keeps track of broker metadata, leader election, and consumer group coordination. However, starting from Kafka version 2.8, Kafka is moving towards removing its dependency on ZooKeeper and adopting a self-managed metadata approach.
  • ZooKeeper cluster: A ZooKeeper cluster is a group of one or more ZooKeeper servers working together to provide coordination and management services for distributed systems like Apache Kafka.
Kafka Architecture

Core APIs of Apache Kafka: Kafka offers a rich set of core APIs that empower developers to build robust, scalable, and real-time data processing applications.

  • Producer API: The Producer API enables applications to publish records to Kafka topics. This API provides options for handling acknowledgments, retries, and batching records for optimal performance.
  • Consumer API: The Consumer API enables applications to subscribe to Kafka topics and consume records from them. This API supports both simple and high-level consumer configurations, allowing developers to control aspects like offset management, parallelism, and fault tolerance.
  • Streams API: The stream API is very powerful. The Streams API is used for building stream processing applications on top of Kafka. It allows developers to create and deploy stateful stream processing applications that consume input from Kafka topics, process data in real-time, and produce output to the same or other Kafka topics means it processes the data before consumers consume it, uber example providing the best route for the trip.
  • Connector API: The Connector API simplifies the integration of Kafka with external systems by providing pre-built connectors for popular data sources and sinks. Connectors allow seamless data movement between Kafka topics and external systems such as databases, message queues, and data lakes. Many developers may need to integrate the same data source type, like MongoDB. Well, not every single developer should have to write that integration, what the connector API allows that integration to get written once the code is there, and then all the developer needs to do is configure it to get that data source into their cluster.
  • AdminClient API: The AdminClient API offers administrative operations for managing Kafka clusters, topics, and configurations programmatically. It allows developers to create, delete, describe, and alter topics, as well as manage consumer groups, partitions, and broker configurations. The AdminClient API is particularly useful for automating administrative tasks and monitoring Kafka clusters.

Important Points for remember

  • Once data is written to a partition, it cannot be changed — this is referred to as ‘immutability’.
  • The Data in Kafka is only kept for a limited time, the default is one week
    but this can be configured.
  • Each Offset for each Partition will have Specific data.
  • Offsets won’t be reused They keep increasing incrementally as you
    send messages into your Kafka topic, That also means the order of messages is guaranteed only within a partition but not across partitions. This is very important to understand.
  • Messages within each partition have increasing offsets, which means they’re in order. We read them in the order of the offsets. But across partitions, we don’t have any control.

Real-Time Usage Examples

Apache Kafka is widely used by several tech giants like Uber, LinkedIn, and Netflix for various real-time data processing and streaming applications

  • Uber: uses Kafka to capture real-time location data from drivers and riders, optimize routes, calculate ETAs, and enhance user experience with personalized recommendations and promotions.
  • LinkedIn: uses Kafka to capture and analyze user engagement metrics, personalize content recommendations, and deliver real-time notifications to users about profile views, job updates, and network activities.
  • Netflix: uses Kafka for real-time recommendation systems, enabling personalized content recommendations based on user preferences, viewing history, and behavior patterns.

Key characteristics of Kafka

  • Real-time Data Processing: In today’s fast-paced digital landscape, organizations need to process data in real-time to gain actionable insights, make timely decisions, and provide responsive services to customers. Kafka’s distributed architecture and low-latency message processing capabilities enable real-time data ingestion, processing, and analysis at scale.
  • Scalability: Kafka’s distributed nature allows it to scale horizontally by adding more brokers and partitions to the cluster. This scalability ensures that Kafka can handle large volumes of data and growing workloads without compromising performance.
  • Fault Tolerance and Reliability: Kafka is designed to be highly fault-tolerant. Data replication across multiple brokers ensures that even if a broker fails, data remains available and processing continues uninterrupted. Additionally, Kafka’s partitioning and replication strategies ensure data durability and reliability.
  • Data Integration and Pipeline Building: Kafka serves as a central nervous system for data integration, enabling seamless communication between various components of a data ecosystem. It acts as a reliable data pipeline for ingesting data from multiple sources, transforming it if necessary, and delivering it to downstream systems for further processing or analysis.
  • Decoupling of Systems: Kafka enables loose coupling between different components of a distributed system. Producers and consumers interact with Kafka independently, allowing them to operate at their own pace and scale independently. This decoupling promotes system resilience and flexibility.
  • Stream Processing: Kafka’s support for stream processing allows applications to process data in motion. With Kafka Streams or other stream processing frameworks integrated with Kafka, developers can build real-time analytics, monitoring, and event-driven applications efficiently.
  • Unified Event Platform: Kafka serves as a unified platform for handling various types of data, including logs, metrics, clickstreams, IoT telemetry, and more. This versatility makes Kafka suitable for a wide range of use cases across industries, from financial services to e-commerce to IoT.

Apache Kafka Alternatives:

  1. Amazon Kinesis
  2. Google Cloud Pub/Sub
  3. Azure Event Hubs
  4. IBM MQ
  5. NATS
  6. Redis Streams
  7. Apache Flink
  8. Apache Pulsar

Apache Kafka’s architecture is meticulously designed to meet the demands of modern data processing and streaming applications. By leveraging concepts such as topics, partitions, brokers, producers, consumers, and replication, Kafka enables organizations to build scalable, fault-tolerant, and real-time data pipelines. As Kafka continues to evolve, it remains a vital component in the data infrastructure stack, empowering businesses to harness the power of streaming data effectively.

That is all for this post, Apache Kafka is quite an extensive topic to cover in a single post, and I’m excited to see you in the next couple of posts with the following topics.

  1. Installation of Apache Kafka and Zookeeper.
  2. Implementation of Apache Kafka in Node.js
  3. Kafka vs RebitMQ
  4. Many more….

If you found this blog post useful then clap, comment, and follow.

🤝 Let’s connect on LinkedIn: Raheel Butt

--

--