Head First Kafka- The basics of consuming data from Kafka explained using a conversation. Part 1

Tim van Baarsen
5 min readAug 27, 2019

--

Last time in the fireside chat Bill & Jack discussed ‘The basics of producing data to Kafka’. Also, this blog is inspired by the ‘fireside chats’ sections in the Head First books and this time we will talk you through the basics of consuming data from an Apache Kafka topic.

In today’s Fireside Chat:

Bill 🤓 = Enthusiastic fresh software developer who developed some applications and has just heard about Kafka.

Jay 😎 = Senior software developer, he knows his way around in Apache Kafka and runs applications leveraging Kafka in production for many years now.

🤓: Hi Jay nice to meet you again. I’m curious to learn how I can consume data from the ‘stocks’ Kafka topic we produced data on last time.
😎: It’s a pleasure to talk again Bill. Indeed last time we produced stock price updates to our ‘stocks’ topic. Let’s take a quick look again.

Last time we were producing price updates on stocks to a Kafka topic. Every record produced distributed append-only has a unique ‘index number’. We call this index number an offset. For every new record, the offset will be incremented.

Stock price updates published to Kafka topic with name ‘stocks’. Every record has a unique offset in the partition.

🤓: Thanks for the recap. What do I need to do to start consuming data from our topic?
😎: It pretty easy you can to use a Kafka command-line interface or a Kafka client library for your favorite programming language. In the latter case, you need to write some code and add some configuration (e.g. the hostname or IP address your Kafka Broker, topic name, etc) to start consuming records from the topic.

🤓: Ok so every single record that has been produced by the producer will be read by my consumer?
😎: No by default a new consumer will start reading from the latest produced record on the topic. In case a new consumer would like to start reading from the earliest created record you have to configure this.

From the Apache Kafka documentation: “What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server.”

auto.offset.reset=earliest, latest, none

Stock price updates produced by producer and consumed by one consumer.

🤓: Ok great. So when I consume a record from the topic it will be removed from the topic, right? Similar to a ‘traditional’ message queue?
😎: No remember Kafka is a distributed append-only log. You can’t remove a record from the topic. The record will stay on the topic after your consumer has read it.

🤓: How about this scenario: at some point in time my application has consumed many price updates from the ‘stocks’ topic and I restart my application. Does this mean that I have to consume all the records again because the records are still there?
😎: Good question! No, you will not receive the already consumed records again. Your consumer will start receiving messages exactly where we left off. An important thing to remember here is that your consumer knows exactly the records it has read. It remembers this by its consumer offset. This offset points to the last consumed record in the Kafka partition. Once in a while (or after consuming each record depending on your configuration), your consumer commits the offset back to Kafka. Once your application is running again it will read the consumer offset back from Kafka and continues consuming!

Let imagine we start consuming from the earliest record:

The consumer starts consuming from the earliest record.

After consuming more records the consumer offset (see this as a pointer) moved to a new offset. You can compare this with reading a book, last night you finished reading page 8 of your book. The next day you start reading from page 9 onwards.

More record consumed. Consumer offset: 8

🤓: Is it possible a completely different application would to consume the same data from the topic?
😎: It’s possible the other application can use the same data from the topic for a different purpose. That other consumer will maintain its own offset.

Two different consumers. Both maintaining their own offset.

Both individual consumers will keep consuming from the topic:

🤓: We already talked a couple of times about record that can only be appended to the log. Does that mean all the records in all topics will be stored in Kafka until the end of time?
😎: That’s possible in case you need to but by default records in a topic will be available 7 days for consumption.

Both the cluster and topic have a configurable retention policy.
This policy can either be:

  • time-based can be configured in milliseconds (cluster level & topic level), minutes (cluster level) or hours (cluster level)

Example: if the retention policy is set to 24 hours, then for one day after a record is published, it is available for consumption, after which it will be discarded to free up space.

  • size-based: configured in bytes (cluster level & topic level).

Example: if the retention policy is set to 500 MB, once the size of the topic log is reached, records will be deleted from the tail of the log. With this size based configuration, it can be hard to predict how long a record will stay available for consumption!

For more information about the Kafka broker / cluster retention settings log.retention.hours, log.retention.minutes, log.retention.ms and log.retention.bytes see the Apache Kafka broker configuration documentation.

For more information about topic retention settings retention.ms and retention.byte that can be configured on a topic see the Apache Kafka topic configuration documentation.

😎: Bill that’s it for today. There is a lot more to talk about so in part 2 of the series ‘consuming data from a Kafka topic’ we will take a look at and learn about:

  • how records are consumed from multiple partitions
  • how multiple consumers can work together in a consumer group to share the load
  • what happens when a consumer joins or leaves the consumer group
  • what stand-by consumers are and why they can be useful

🤓: Thanks Jay I’m looking forward to part 2!

--

--

Tim van Baarsen

I’m a creative and passionate software developer living in the Netherlands. Occasional meetup & conference speaker.