Apache Kafka Guide #2 Topics, Partitions and Offsets
Hi, this is Paul, and welcome to the second part of my Apache Kafka guide. Today we’re gonna talk about how to work Topics, Partitions, and Offsets in Apache Kafka.
Kafka Topics
Kafka topics represent specific data streams within a Kafka cluster. A Kafka cluster can contain multiple topics, which can be named anything, like logs, purchases, twitter_tweets, cars_gps, etc. Essentially, a Kafka topic is a data stream. If you compare it to databases, a Kafka topic is somewhat like a database table but without the usual constraints. You can send any data to a Kafka topic without data verification. This concept will be elaborated on later. Kafka topics can handle various message formats, including JSON, Avro, text files, binary, and more. The order of messages in a topic forms a data stream, which is why Kafka is known as a data streaming platform. Topics are not queryable like database tables; instead, data is added to a Kafka topic using Kafka Producers, and to read data from a topic, Kafka Consumers are used. Kafka does not have querying capabilities.
Topics: stream of data
- Like a table in a DB (without constraints)
- No limit of amount of topics
- A topic identifier
name
- Any messages format
- The sequence of messages is called a
data stream
- You can’t execute a query to the topic, instead, use Producers to send data and Consumers to read the data
Partitions and offset
Topics in Kafka are broad categories that can be split into smaller sections called partitions. For instance, a single topic might contain 100 partitions. In this example, we’re focusing on a Kafka topic that has three partitions: partitions zero, one, and two.
Messages sent to a Kafka topic are distributed across these partitions. Each message in a partition receives a unique identifier, called ID, which starts at 0 and increments with each new message. So, in partition zero, the first few messages might have IDs like 1, 2, 3, and so on. As more messages are added, this ID continues to increase.
This incrementing ID is known as the Kafka partition offset.
Throughout this guide, you’ll often hear me mention ‘offsets.’ Each partition maintains its own set of offsets.
It’s important to note that Kafka topics are immutable. This means that once data is written to a partition, it cannot be altered. You can’t delete or update data in Kafka; instead, you continuously add to the partitions.
Example
Imagine you manage a group of cars, each equipped with a GPS. This GPS regularly updates its location to Kafka. Every car sends its position to Kafka approximately every 20 seconds. Each update includes details like the car’s ID and its exact location, given as latitude and longitude.
So, we have several cars acting as data sources. They feed information into a Kafka ‘topic’ (a kind of data channel) named ‘cars_gps’, which stores all the cars’ locations.
We’ve set up this ‘cars_gps’ topic to have multiple sections, known as partitions — in this case, 10 of them. The number of partitions is chosen based on specific needs, which I’ll explain later in the guide.
Once Kafka has this topic ready, we can start using the data. For instance, we might have a dashboard that shows where each car is in real time. Or, we might use the same data for a notification system. This system could alert customers when their delivery is almost there.
The great thing about Kafka is that it lets different services use the same data stream simultaneously.
Conclusion
Okay, so let’s summarize some key points about topics, partitions, and offsets in Kafka.
- Firstly, once data is written to a partition, it cannot be changed. This principle is known as immutability and is crucial to understand. Kafka stores data for a limited period, typically a week, but this duration is adjustable. After this period, the data disappears.
- Each partition’s offsets are unique. For instance, offset three in partition zero and offset three in partition one refer to different messages. These offsets continue to increase and aren’t recycled, even if previous messages are deleted.
- It’s essential to note that message order is maintained only within a partition, not across multiple partitions. For ordered messages, we’ll explore strategies to achieve this.
- When sending data to a Kafka topic, it’s assigned to a random partition, like zero, one, or two, unless a key is specified. Topics in Kafka can have numerous partitions, varying from a few to hundreds, and we’ll discuss how to choose the right number for our topic.
In summary, we’ve covered Kafka topics, partitions, and offsets, and touched on some specific Kafka features. I hope this information was helpful, and I look forward to our next article.
See you in the next part of the guide!
Paul Ravvich
Thank you for reading until the end. Before you go:
- Please consider clapping and following the writer! 👏
- Follow us on Twitter(X), LinkedIn