Understanding Apache Kafka: One Component at a Time — Part -1

Sachin Tripathi
8 min readMay 12, 2024

--

“The nice thing about standards is that you have so many to choose from.” ~ Andrew S. Tanenbaum

In the world of data streaming, where developers are faced with multiple choices. Among these options, Apache Kafka stands tall for real-time data processing.

In this blog series, we’ll go on a journey to explore the key components and concepts of Kafka, so you can make informed decisions and build robust data streaming applications.

What is a Apache Kafka ?

Apache Kafka is a distributed event streaming platform that enables the development of real-time, event-driven applications.

It follows the publish-subscribe messaging pattern, where producers publish messages to topics, and consumers subscribe to those topics to receive and process the messages.

Kafka’s architecture is designed for high throughput, fault tolerance, and scalability.

Errr…. too much jargon simplifying it!!

distributed event streaming platform : all it really means is that Kafka is like a big postal system for delivering messages (events) from senders to receivers. And the “distributed” part just means that this postal system has multiple post offices (servers) spread out across different locations, so it can handle a huge amount of mail without getting overburdened

publish-subscribe messaging pattern : It’s like your favourite sports magazine subscription — you don’t know exactly who is writing the articles, but you know you will get the content you are interested in

High throughput: Kafka can handle a ton of messages at a time. It’s like running a super-efficient postal system that can deliver so many letters per day without any trouble .

fault tolerance : Kafka is better at dealing with problems . Even if one of the post offices catches fire Kafka has backup copies of all the messages stored in other post offices. So, even if something goes wrong, your messages will still get delivered.

scalability: Kafka can easily grow and adapt to handle more messages and more subscribers as your system gets bigger

What is Kafka made up of High level ?

The key components of Kafka include:

  1. Producers: Applications that publish messages to Kafka topics.
  2. Topics: Logical feeds to which messages are published.
  3. Partitions: Each topic is divided into partitions for parallel processing.
  4. Brokers: Kafka servers that store and manage message persistence and distribution.
  5. Consumers: Applications that subscribe to topics and process the messages.
  6. Consumer Groups: Collaborative consumption where multiple consumers share the workload.

Design Principle

  • Event Sourcing: Kafka can serve as an event store, persisting an ordered sequence of events.
  • Command Query Responsibility Segregation (CQRS): Separate read and write paths for better scalability.
  • Log Aggregation: Collect logs from multiple services into a central Kafka cluster for analysis.

Anti-Patterns:

  • Using Kafka as a database replacement: Kafka is not designed for primary data storage or querying.
  • Overloading Kafka with large payloads: Kafka is optimized for high-throughput, small messages.

In this blog we will be taking a deep dive into Kafka Producers

What Exactly is a Kafka Producer?

Simply, it is a client application that publishes messages to Kafka topics.
But there’s more to it . Producers are responsible for :
determining which records are sent to which Kafka partition within the specified topic. This is more important than you think of it has a significant impact on the overall scalability, performance, and reliability of your system.

They ensure that data is efficiently and reliably ingested into Kafka, so it can be processed and analyzed downstream.

Core Components :
At the heart of every Kafka Producer are two key components: the Producer API and Serialization.

Producer API The main tool in this is the KafkaProducer class.
It is responsible for managing the connection to the Kafka brokers and sending your data as messages.
Think of the KafkaProducer as a messenger who takes your data, packages it nicely, and delivers it to the right place in the Kafka cluster.

Enough Now, show me the code!!

from kafka import KafkaProducer
# Create a KafkaProducer instance
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
# Send a message to a topic
topic = 'my_topic'
message = b'Hello, Kafka!'
producer.send(topic, message)
# Flush the producer to ensure all messages are sent
producer.flush()

Let’s see the code:

1- First, we import the KafkaProducer class from the kafka module.
2- We create an instance of the KafkaProducer by specifying the bootstrap_servers parameter, which is a list of Kafka broker addresses. In this example, we assume that Kafka is running locally on the default port 9092.
3- We define the topic name (‘my_topic’) and the message we want to send. The message is a bytes object (b’Hello, Kafka!’).
4- We use the send() method of the KafkaProducer to send the message to the specified topic. The producer takes care of serializing the message and sending it to the appropriate Kafka broker.
5- Finally, we call the flush() method to ensure that all the messages in the producer’s buffer are sent to Kafka before the program exits

Serialization, is the process of converting messages from native data structures into byte arrays that Kafka can store. Think of it as the translator that helps Kafka understand the language of your data.

Serialization can be configured per-topic or per-producer.
Some common serializers include StringSerializer, ByteArraySerializer, and KafkaAvroSerializer, which integrates with schema-based systems like Confluent Schema Registry.

Key Features of Producers :

Message Batching and Buffering :
Each producer has a buffer, specified by buffer.memory, that temporarily stores records ready to be sent to the broker. This buffer acts as a holding area, decoupling the application thread from network I/O and allowing producers to handle high data volumes asynchronously.

It’s like having a waiting room where messages can hang out until they’re ready to be sent.

Batching, controlled by batch.size and linger.ms, groups multiple records into a single request to improve throughput and reduce network overhead.

batch.size defines the maximum batch size in bytes,
linger.ms specifies the maximum time to wait before sending a batch

instead of sending each message individually, Kafka Producers batch them together for a more efficient ride. This batching mechanism helps optimize network utilization and improves overall system performance.

When a producer sends messages to Kafka, it uses the TCP/IP protocol .

TCP is a connection-oriented protocol that ensures reliable, ordered, and error-checked delivery of data between the producer and the Kafka broker. IP is responsible for routing the data packets across the network.

Reduced Network Overhead: By batching messages together, the producer reduces the number of network roundtrips required to send data to Kafka. . Batching minimizes this overhead by sending multiple messages in a single TCP connection, thereby reducing the overall network traffic.

Reduced Latency: While batching introduces a slight delay in sending individual messages , it can significantly reduce the overall latency of message delivery. By sending batches of messages, the producer avoids the latency associated with sending the single message with header overhead .

LinkedIn Engineering Blog, “Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)” — https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
batch_size=10000, # Batch size in bytes
linger_ms=1000 # Delay in milliseconds before sending a batch

In this example, we set the batch_size to 10000 bytes and linger_ms to 1000 milliseconds. This means that the producer will wait up to 1 second to accumulate messages until the batch size reaches 10000 bytes before sending them to Kafka.

Partitioning Strategy:

The most important decisions a Kafka Producer has to make is which partition to send a message to.
By default, Kafka uses either the record key or a round-robin approach if no key is provided.

If a key is present, Kafka uses it to hash the message to a specific partition deterministically, ensuring that all messages with the same key go to the same partition

But sometimes that’s not enough, then comes:
Custom partitioning : allows developers to implement logic based on message content or external factors.
For example, a custom partitioner could route messages based on user demographics to optimize consumer processing. Implementing a custom partitioner involves overriding the partition() method of the Partitioner interface.

from kafka import KafkaProducer
from kafka.partitioner import Partitioner
class GeographicPartitioner(Partitioner):
def __init__(self, partitions):
self.partitions = partitions
def partition(self, key, all_partitions, available_partitions):
if key is None:
return self._default_partition(all_partitions, available_partitions)
# Extract geographic information from the key
geographic_info = self._extract_geographic_info(key)
# Determine the partition based on geographic information
partition = self._determine_partition(geographic_info)
# Ensure the determined partition is within the available partitions
if partition not in available_partitions:
return self._default_partition(all_partitions, available_partitions)
return partition
def _extract_geographic_info(self, key):
# Implementation to extract geographic information from the key
# This could involve parsing the key or using an external lookup
# Return the extracted geographic information
pass
def _determine_partition(self, geographic_info):
# Implementation to determine the partition based on geographic information
# This could involve mapping geographic regions to specific partitions
# Return the determined partition
pass
def _default_partition(self, all_partitions, available_partitions):
# Default partitioning logic when the key is None or the determined partition is not available
# This could be a round-robin or random selection among available partitions
pass
# Create an instance of the custom partitioner
partitioner = GeographicPartitioner(num_partitions)
# Create a Kafka producer with the custom partitioner
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
partitioner=partitioner
)
# Send messages with geographic keys
producer.send('geographic_topic', key='new_york', value=b'Message for New York')
producer.send('geographic_topic', key='london', value=b'Message for London')
producer.send('geographic_topic', key='tokyo', value=b'Message for Tokyo')
# Flush and close the producer
producer.flush()
producer.close()

Reliability and Fault Tolerance :

Acknowledgments (acks) determine how many partition replicas must acknowledge a record’s receipt before considering the send operation successful.

acks=1 means only the leader, acks=all means all in-sync replicas.

the more acknowledgments you require, the more confident you can be that your message was successfully delivered.

(We will dive replicas and partition more in next article when we discuss brokers)

Retries and Idempotence To handle transient failures, producers can automatically retry sending records. Setting enable.idempotence=true ensures that retries do not result in duplicated records.

Performance Optimizations:

Compression Producers can compress data at the message set level before sending it to brokers, reducing the size of the data sent over the network and stored on disk. Compression types include none, gzip, snappy, and lz4.

Batching (discussed above) is nother way to achieve Performance Optimizations

Monitoring Kafka Producers:

There are a few key metrics to watch, such as

Send Rate, Error Rate:track the rate of data being sent and any errors encountered.

kafka-producer-perf-test - topic <topic-name> - num-records <number-of-records> - record-size <record-size> - throughput <throughput> - producer-props bootstrap.servers=<broker-list>

and Buffer Utilization: monitors how much of the buffer memory is being used, indicating if the producer is producing data faster than it can send it.

Conclusion:

Our journey doesn’t end here. In the upcoming posts of this series, we will dive into other essential components of the Kafka ecosystem:

Brokers, Topics, and Partitions: How Kafka organizes and stores data for optimal performance and fault tolerance.
Consumers and Consumer Groups: Learn how to efficiently process and consume data from Kafka topics.
ZooKeeper and Kafka Raft (KRaft): Understand the role of distributed coordination in Kafka and explore the future of Kafka without ZooKeeper.
Kafka Streaming: Use the power of Kafka’s native stream processing capabilities to build real-time data pipelines.
Apache Flink: Discover how Flink integrates with Kafka to enable complex, stateful stream processing.

--

--