In the Land of Streams — Kafka Part 1: A Producer’s Message

A Kafka Streaming Ledger

Giannis Polyzos
6 min readDec 13, 2022
https://www.vecteezy.com/free-vector/squirrel-cartoon

In the land of Streams Kafka is a well-known event streaming platform. This blog post series assumes some basic familiarity with Kafka — creating producers and consumers — and will focus on providing a better understanding of how Kafka works under the hood to better design and tune your applications.

The Blob post series consists of the following parts:
- Part 1: A Producer’s Message (this blog post)
- Part 2: The Rise of the Consumers
- Part 3: Offsets and how to Handle them
- Part 4: My Cluster is Lost!!— Embracing Failure

Message Lifecycle: PPC (Produce, Persist, Consume)

Using a simple file ingestion data pipeline, Part 1 of the series aims to cover the following:

  1. Ingest data files (with click event data) into Kafka
  2. Explain how the producing side works
  3. Producer configuration tuning
  4. Throughput vs Latency

I will be using some e-commerce datasets you can find here. The code samples are written in Kotlin, but the implementation should be easy to port in any programming language.

You can find the source code on Github here.

Note: If you want to run the code samples you will need a cluster and you can find a docker-compose file in the repo.

But if you are like me, when you test different configurations and investigate for better tuning you will want a real cluster to account for network delays, disk and the overall experience.

Aiven for Kafka is the easiest way to get started and you get many extras along — like out of the box observability.

Let us dive right in.

👀 Show me the Code 👀

Our producer will send some events to Kafka. The data model for the click events should look similar to the following payload:

I will create a topic called ecommerce.events which I will use to store my messages. The topic will have 5 partitions and a replication factor of 3 (leader + 2 replicas).

Creating Kafka Producers is pretty straightforward, the important part is creating and sending records.

Producing to Kafka

For every event, we create a ProducerRecord object and specify the topic, the key (here we partition based on the userId), and finally the event payload as the value.

The send() method is asynchronous, so we specify a callback that gets triggered when we receive a result back.

If the message was successfully written to Kafka we print the metadata, otherwise if an exception is returned we log it.

But what actually happens when the send() method is called?

Producer Internals

Kafka does a lot of things under the hood when the send() method is invoked, so let’s outline them below:

  1. The message is serialized using the specified serializer
  2. The partitioner determines in which partition the message should be routed.
  3. Internally Kafka keeps message buffers; we have one buffer for each partition and each buffer can hold many batches of messages grouped for each partition.
  4. Finally, the I/O threads pick up these batches and sent them over to the brokers.
    At this point, our messages are in-flight from the client to the brokers. The brokers have sent/receive network buffers for the network threads to pick up the messages and hand them over to some IO thread to actually persist it on disk.
  5. On the leader broker, the messages are written on disk and sent to the followers for replication. One thing to note here is that the messages are first written on the PageCache and periodically are flushed on disk.
    (Note: PageCache to disk is an extreme case for message loss, but still you might wanna be aware of that)
  6. The followers (in-sync replicas) store and sent an acknowledgment back they have replicated the message.
  7. A RecordMetadata response is sent back to the client.
  8. If a failure occurred and we didn’t receive an ACK, we check if message retry is enabled and we need to resend it
  9. The client receives the response.

Tradeoffs between Latency and Throughput

In the distributed systems world most things come with tradeoffs and it’s up to the developer to find that “sweetspot” between different tradeoffs; thus it’s important to understand how things work.

One important aspect might be tuning between throughput and latency. Some key configurations to that are batch.size and linger.ms.

Having a small batch size and also linger set to zero can reduce latency and process messages as soon as possible — but it might reduce throughput. Configuring for low latency is also useful for slow produce rate scenarios. Having fewer records accumulated than the specified batch size will result in the client waiting linger.ms for more records to arrive.

On the other hand, a larger batch size might use more memory (as we will allocate buffers for the specified batch size) but it will increase the throughput. Other configuration options like compression.type, max.in.flight.requests.per.connection, and max.request.size can help here.

Let’s better illustrate this with an example.

Our event data is stored in CSV files that we want to ingest into Kafka and since it is not real-time data ingestion we don’t really care about latency here, but having a good throughput so we can ingest them fast. Using the default configurations ingesting 1.000.000 messages takes around 184 seconds.

Setting batch.size to 64Kb (16 is the default), linger.ms greater than 0 and finally compression.type to gzip

Has the following impact on the ingestion time.

From around 184 seconds dropped down to 18 seconds. In both cases ack=1.

I will leave it up to you to experiment and try different configuration options to see how they better come in handy based on your use case.

Finally if you are concerned with exactly-once semantics, set enable.idempotency to true, which will also result in acks set to all.

Wrapping Up

So far we saw how the producing side of Kafka works. As takeaways when creating and producing applications:

  • Think of the requirements and try to tune between throughput and latency
  • Think of the guarantees you need your producers to provide; i.e for exactly once semantics idempotency and/or transactions might be your friends there.
  • One detail not mentioned before but is good to know is that If you want to create multi-threaded apps, its best to create one producer instance and share it among threads

Check Next: Part 2 The Rise of the Consumers.

--

--

Giannis Polyzos

Staff Streaming Product Architect @ Ververica ~ Stateful Stream Processing and Streaming Lakehouse https://www.linkedin.com/in/polyzos/