Kafka ins and outs: Part 2

Let’s talk about Producers

Published in

WAES

10 min readAug 24, 2023

If you read (and liked) the Kafka ins and outs: Part 1 article, you should be familiarized with the basics of Apache Kafka — core capabilities, what is a Topic, the purpose of Partition and how it works the Replication Factor.

Now, it's time to explore another crucial component of Kafka's architecture: the producer. In this second part, we will discuss what a producer is, how it writes data to Kafka, and how it maintains the order of records.
Let's keep going on this journey to be a Kafka Specialist!

What is a Producer?

In the context of Apache Kafka, a producer is an entity, usually an application, that publishes or writes data on topics. Producers send records, which are key-value pairs, to Kafka, where they are stored in topics for consumption by consumers. The role of a producer is to ensure that data is correctly sent to the Kafka cluster.

The Producer representation in the traditional Pub-Sub architecture

Producers can find out the data's destination regarding specific brokers or partitions. They specify the Topic, and Kafka handles data distribution across the different partitions within that Topic. This abstraction simplifies writing data in Kafka, allowing producers to focus on creating records. Talking about records, before we go deep into how the writing process works, let's first take a peek into the Record structure.

Kafka Records: Key and Value

A Kafka record, also known as a message, is Kafka's fundamental unit of data. Each record consists of a key and a value, both of which are byte arrays. The key and value are used to store and retrieve data, and they can be of any type that can be serialized into a byte array, such as String, Integer, Long, or even a custom object.

A representation of a record with the key and the value

Key: the key in a Kafka record determines the Partition within the Topic where the record will be stored. Records with the same key are stored in the same Partition. This is important because Kafka guarantees order within a partition, not across partitions in a topic. So, if the order of records matters, it's important to define the key so that records that need to be processed in order are produced with the same key.

The key can also be null. In this case, the producer will use a round-robin algorithm to choose a partition to send the record to, effectively load-balancing the records across the available partitions.

Value: the value in a Kafka record is the actual data you want to store and process. It can be anything serialized into a byte array, from simple string messages to complex, structured data like JSON or Avro records.

It's important to note that Kafka itself is agnostic to the format of your data. It doesn't know or care what is in your records. It simply stores the byte arrays and leaves it up to the producers and consumers to serialize and deserialize them.

How Does Writing Work?

When a producer is ready to send a record, it must first know which Partition to send it to. If a partition is not specified in the producer record, the producer uses a partitioner to decide which Partition the record should go to. The default partitioner sends records with a specified key to the same Partition, ensuring that all records with the same key end up in the same Partition.

The producer then sends the record to the broker leader of the target partition. The record is added to a batch destined for the same Partition. These batches are sent to the broker regularly or when the batch is full.

The Kafka broker then writes the batch of records to the partition log and returns an acknowledgement to the producer. The producer can resend the batch if the acknowledgement is not received within a specified time.

How Does Ordering Work?

In Kafka, records are written to partitions in the order the producer sends and assigned a sequential id number that uniquely identifies each record within the Partition. This id is called the offset. As new records arrive, they are appended to the end of the log, and the offset is incremented.

This means that Kafka maintains the order of records within each partition. If the order of records is important, the producer can ensure that records are sent to the same partition by specifying a key. All records with the same key will go to the same partition, and since Kafka maintains the order within a partition, the order of these records is also maintained.

However, Kafka does not guarantee the order across partitions if records are sent to different partitions. This is because records in different partitions can be written and read independently, so their order can differ.

Why is Ordering Important?

Ordering in Kafka is essential when your application's sequence of events or data matters. This is particularly important in scenarios where the state of a particular entity is updated over time, and the sequence of these updates affects the entity's current state.

Let's consider our Topic_Awesome_Players, which contains data about football players, including the number of matches played, goals scored, and assists made. Each record might represent a player's statistics update after a match.

If these records are not processed in the order they were produced, we might end up with incorrect statistics. For example, let's take into account the following image, which contains two records:

Example of 2 records sharing the same key

If we pay attention to the number of matches played, we can see the record on the right side should be consumed first because it is older than the left one. It's a simple business rule: it is not possible that Messi played 10 matches before completing 8 matches played. Unless we have some time machine involved, which is not the case.

By ensuring that all records for a specific player — identified by a unique key — go to the same Partition, Kafka guarantees the order of these records. This means the system processing these records will process them in the order of matches played, ensuring the player's statistics are calculated correctly.

Hard to follow? No problem, let's do together a simulation of a producer!

Topic Awesome Players: producing some records!

Imagine we have our Topic Topic_Awesome_Players, and we are responsible for the producer that populates this Topic with data about some of the best players in the world. Let's do it!

First, we want to produce a record with the key messi. Because our Topic is empty, our key is not assigned yet to any partition. A round-robin algorithm is applied, and Kafka defines the partition 0 for the key messi .

Producing the first record with the key messi

Then we produce a record with the key cr7. Again, we have a new key so it is not assigned. Using round-robin Kafka defines partition 1 to key cr7.

Producing the first record with the key cr7

Now we produce a new record with the key messi. At this point, we are producing a record with a key with an assigned partition. Because of this, Kafka will store the record in the assigned Partition, in this case, partition 0.

Producing the second record with the key messi

Let's produce a record with a new key again, this time lewandovski. It's a brand new key, and then Kafka applies round-robin and assignees partition 2.

Producing the first record with the key lewandovski

Again, we produce a record with a new key haaland. The same happens here, new key, round-robin, assignees partition 2.

Producing the first record with the key haaland

So now we produce a new record with the key cr7. This key is already assigned to partition 1, so Kafka will ensure to write the record in the very same Partition.

Producing the second record with the key cr7

Finally, we produce a record with the key mbappe. Because it's a new key, Kafka applies round-robin and assignees partition 2 to this key.

Producing the first record with the key mbappe

In summary, every time we produce a record with a never-used key, Kafka assigns one Partition. From that moment on, all records sharing the same key will be stored in the same Partition. With this, Kafka can guarantee the ordering of our records!

Producer Acknowledgments (Acks)

Producer acknowledgements, often called "acks", are an important part of Kafka's fault-tolerance mechanism.

When a producer sends a record to a Kafka broker, it can choose to receive an acknowledgement of receipt. This acknowledgement informs the producer that the broker has received and written the record to the log. The level of acknowledgement required can be configured based on the producer's tolerance for data loss.

There are three main types of acknowledgement settings:

Acks = 0: The producer will not wait for any acknowledgement from the broker. This setting provides the lowest latency but also the highest risk of data loss. Also known as fire and forget, this option can be good when you need performance over guarantee, for instance, if you are generating records of user clicks on your website. You don't care about losing some clicks, but you must perform well.
Acks = 1: The producer will wait for the leader broker to acknowledge the receipt of the record. This setting provides a balance between latency and data durability. This could be a good choice when you are generating important — but not crucial — records, like user report logs of your application.
Acks = all or Acks = -1: The producer will wait for the full set of in-sync replicas (ISRs) to acknowledge the receipt of the record. This setting ensures the highest data durability as it guarantees that the record will not be lost as long as one in-sync replica remains alive. This is the perfect choice for your application's crucial flows, for instance, financial transactions.

Log Compaction

Log compaction is a mechanism in Kafka that helps maintain the health of the log over time. Logs can grow indefinitely in a typical Kafka system as producers continue to write new records. Over time, this can lead to storage issues and degrade performance.

Log compaction addresses this issue by preserving the latest update for each record key within the log of each topic partition. It operates on a per-partition basis and retains the last known value for each record key for a given topic partition. For example, let's take a look into our topic Topic_Awesome_Players:

Topic_Awesome_Players we populate in our previous example

In a compacted log, Kafka ensures that for a particular key, there is always at least the latest update present in the log. This means that applications can rely on Kafka to store a full snapshot of final record states up to log compaction.

This is how our Topic_Awesome_Players looks when we use the log compaction:

Example of Topic_Awesome_Players with Log Compaction enabled

And what happens when new events come? The same rule applies!
Kafka will keep only the last record state for each:

Example of new records produced in Topic_Awesome_Players with Log Compaction enabled

This behaviour is defined in the Topic's property cleanup.policy where the default value is delete. To enable Log Compaction, we should change this property to compact.

Example of command to enable Log Compaction

Log compaction is particularly useful for topics that store data where each key's latest state is important, like a database that updates and deletes overwrites previous values for a specific key.

Conclusion

If you are reading this, now you know what a producer is, how it sends data to Kafka, and how it keeps everything in order. You also understand how producers get acknowledgements and how the log compaction helps keep things tidy in Kafka. You are ready to start producing records!

But that’s not all! Next time, we'll switch gears and focus on the other side of the equation — the consumers. We'll see how they get data from Kafka and interact with the data that producers send. We'll also talk about what is and the importance of Distributed Coordination Services.

Stay tuned for the last part of our Kafka ins and outs article soon!

Do you think you have what it takes to be one of us?

At WAES, we are always looking for the best developers and data engineers to help Dutch companies succeed. If you are interested in becoming a part of our team and moving to The Netherlands, look at our open positions here.

WAES publication

Our content creators constantly create new articles about software development, lifestyle, and WAES. So make sure to follow us on Medium to learn more.

Also, make sure to follow us on our social media:
LinkedIn — Instagram — Twitter — YouTube