Apache Kafka Guide #3 Producers and Message Keys
Hi, this is Paul, and welcome to the third part of my Apache Kafka guide. Today we’re gonna talk about how to work Producers and Message Keys.
Producers
So, we’ve explored the overall topics and data, but we need to create a Kafka producer for writing data on these topics. Producers write to topic partitions, like partition 0 for topic A, followed by partitions 1 and 2. Data is written sequentially, marked by Offsets. Your producer sends data to Kafka topic partitions, knowing in advance the specific partition and Kafka broker (server) to use. We’ll soon delve into Kafka brokers.
It’s a misconception that Kafka servers decide which partition receives the data; it’s predetermined by the producer, which we’ll explore further. If a Kafka server with a specific partition fails, producers can automatically recover. There are a lot of intricate workings in Kafka we’ll gradually uncover.
Kafka achieves load balancing as producers distribute data across all partitions using certain mechanisms. This scalability is due to multiple partitions within a topic, each receiving messages from one or more producers.
Producers: Message Keys
In this explanation, producers have keys in their messages. The message contains data, and you can add an optional key, which can be any type, like a string, number, or binary. There are two scenarios here.
In this example, a producer sends data to a topic with two partitions. If the key is null, data is sent in a round-robin fashion to partitions, ensuring load balancing. A null key indicates no key in the producer’s message. But if the key exists, it holds some value, which again could be of various types.
Kafka producers have a crucial feature: messages with the same key always go to the same partition due to a hashing strategy. This is essential in Apache Kafka for maintaining order in messages related to a specific field.
If tracking each car’s position in the sequence is important, use the car ID as the message key. For instance, car ID 123 would always go to partition 0, allowing you to track that car’s data in order. Similarly, car ID 234 goes to partition 0. Which key goes to which partition is determined by a hashing technique, which will be explained later. Other car IDs, like 345 or 456, would consistently end up in partition 1 of your topic A.
- The manufacturer has the option to include a key with the message (such as a string, number, or binary).
- If the key is null, data is distributed evenly.
- If the key is not null, then messages with that key always go to the same partition due to hashing.
- A key is usually included if message sequencing is required for a certain field, like car_id.
Messages anatomy
This is what a Kafka message looks like when created by the producer.
It has a key, which might be null, and is in binary format. The message’s value, or content, can also be null but usually isn’t. It holds your message’s information. You can compress your messages to make them smaller using methods like gzip, snappy, lz4, or zstd. Messages can also include optional headers and a list of key-value pairs, and a timestamp, set by either the system or the user, is added. This forms a Kafka message, which is then stored in Apache Kafka.
Kafka Message Serializer
Kafka accepts only byte sequences from producers and sends byte sequences to consumers.
However, our messages aren’t initially in bytes. So, we perform message serialization, converting your data or objects into bytes. It’s a straightforward process, and I’ll demonstrate it now.
Serializers are used for both the message’s key and value. For instance, consider a key object, like a car ID — let’s say 123. The value might simply be a text like “Hello world”. These aren’t bytes yet but objects in our programming language. We’ll designate an integer Serializer for the key. The Kafka producer is then able to convert this key object, 123, into a byte sequence, creating a binary representation of the key.
For the value, we’ll specify a string Serializer. The Serializers for the key and value differ, meaning it can smartly convert the text “Hello world” into a byte sequence for the value.
With both key and value in binary form, the message is ready to be sent to Apache Kafka. Kafka producers include common Serializers like string (including JSON), integer, float, Avro, Protobuf, and more, aiding in this data transformation. There are many message Serializers available.
Key Hashing
In Kafka, there’s a system called a partitioner. It’s like a set of rules that decides where a message is. When we send a message, this partitioner looks at it and chooses a parking space, says “space one”, and then the message is sent there in Apache Kafka.
The way a message key is matched to a parking space involves key hashing. In Kafka’s default system, they use a method called the murmur2 algorithm:
targetPartition = Math.abs(Utils.murmur2(keyBytes)) % (numPatririons - 1)
Imagine this algorithm as a formula that looks at the key processes it, and then decides which parking space it should go to.
The important thing to understand is that the producers, or the ones sending the messages, use the key to decide where the message will end up. They do this by hashing the key. That’s the main idea.
This explanation is for those interested in more advanced details. The main thing to remember is how messages are directed to their places in Kafka.
In the next article #4, we deep dive into how the Consumer works in Apache Kafka.
See you in the next part of the guide!
Paul Ravvich
Thank you for reading until the end. Before you go:
- Please consider clapping and following the writer! 👏
- Follow us on Twitter(X), LinkedIn