Kafka — Why does Kafka use a pull-based message consumer?

While Push message consumption seems state-of-the-art & pull looks more traditional, let's see why Kafka goes for a traditional approach instead of the other.

Published in

CodeX

6 min readAug 13, 2023

If you think Kafka’s real time data streams & recieving events on an endpoint as they happen must be based on a design that is pushing the events all the way through then that's not true. Kafka’s Consumer mechanics are still based on pull but the abstractions provided by assemblies/frameworks over vanilla Kafka may give this feeling. More on this at the end of this write-up.

If you are already familiar with Push-based and Pull-based message consumption, please feel free to skip the next 2 sections. Otherwise, here is a short intro.

Push-based message/data consumption

Push-based message/data consumption is a data consumption model where the data source/message broker actively delivers messages to consumers without the consumers explicitly requesting it. Consumers only have to show interest by subscribing.

Push notification is associated with the following benefits

Real-time message delivery — Typically, the data is pushed to the consumers as soon as it's available.
A simpler consumer with less resource consumption — Consumers don't need to bother about latency or implementing an efficient polling loop as the broker handles it.

On the other hand, it has some challenges

Consumers must be able to keep up with the rate of messages being produced.
Consumers must be available to receive messages.

Pull-based message/data consumption

Pull-based message consumption is a data consumption model where the consumer polls (actively requests data) data source rather than automatically having the data pushed to it.

In this model, the consumer initiates the retrieval of data when it is ready to process it, and it can control the rate at which it consumes the data.

Let's now return to the primary topic of this article, how does this traditional polling help Kafka?

Why does Kafka use a pull-based message consumer model?

Let's first refresh the fundamentals of Kafka.

Kafka is not a queue, it's a log — a log of messages/events. Producers append to the log and consumers read from the log.

Kafka is a Log where events stay until the retention period.

The log is immutable but it can’t store an infinite amount of data, therefore there is a configured time to live for the records in the log. On the other hand in a queue, the messages are deleted once processed/picked up by the consumer.

Kafka Producers and Consumers. Highlighting poll loop.

Producers send a produce request with records to the log, and each record, as it arrives, is given a special number called an offset, which is the logical position of that record in the log. Consumers send a fetch request to read records, and they use the offsets to bookmark, like placeholders.

Now let’s see how the pull model helps Kafka.

1. Diverse consumers

The log schematics allow multiple autonomous consumer groups to read from the log at a pace feasible for the consumer group.

Kafka is deployed with consumers varying in processing capacities & requirements. For simplicity concept of partition is not shown in this image.

As shown in the image above, the Pull-based consumer model allows diverse consumers with varying consumption rates.

A pull-based consumer can fall behind and catches up when it can (known as Backpressure Handling). In a push-based consumption model, because the broker controls the rate of message transfer, it can overwhelm the consumer when the data is produced faster than it can be consumed.

2. Dumb Pipes and Smart Endpoints Principal

As Kai Waehner states “Intentionally, Kafka was built on the same principles as modern microservices using the ‘dumb pipes and smart endpoints‘ principle. That’s why Kafka scales so well compared to traditional message brokers”
https://www.kai-waehner.de/blog/2022/05/30/error-handling-via-dead-letter-queue-in-apache-kafka/

Filtering, error handling & batching happens in client applications instead of the broker.

Kafka’s pull-based consumption model provides consumers with finer control over message processing, error handling, acknowledgement, and offset management. And the possibility to re-consume data if need be.

Having such fine control also enables transactions, which Kafka provides when consuming from a Kafka topic & producing to another Kafka topic.

The other advantage of a pull-based system is that it lends itself to efficient batching of data sent to the consumer. A push-based system must choose to either send a request immediately or accumulate more data and then send it later without knowledge of whether the downstream consumer will be able to process it immediately.

Note — Error handling on client applications is not a universal virtue of pull-based message consumption but it's how Kafka is designed. For example, in case of invalid messages, the Kafka client application decides whether it should be sent to a dead letter queue. In contrast, AWS-SQS which is a very sophisticated pull-based message queue, can automatically send messages to a dead letter queue if consumers are unable to process it after a configured number of attempts.

Why it appears that Kafka has a push-based message consumer?

As already stated, Kafka consumer works by issuing “fetch” requests to the Kafka brokers. The consumer specifies its offset in the log with each request and receives back a chunk of log beginning from that position.

The general deficiency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop wasting resources while waiting for data to arrive. To avoid this, Kafka has parameters in its fetch request that allow the consumer request to block in a “long poll”, waiting until data is available or until time expires.

But this low-level polling loop may not be visible to many applications because the frameworks/assemblies built over Kafka consumers abstract it away. Giving an impression of push-style notification.

Kafka Connectors, Kafka Streams, ksql Queries, Spring Kafka are some examples of abstractions over Vanilla Kafka producer & Kafka consumer.

Kafka Streams simplifies application development by building on the Apache Kafka® producer and consumer APIs, and leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity. https://docs.confluent.io/platform/current/streams/architecture.html

In case such abstractions are not available in the programming language of the application. It's possible to build a custom abstraction — In principle, the poll loop can read a message (or messages) from Kafka & publish it to application delegate or endpoint (in a separate thread), creating a subscription based push notificaiton for the application.

Summary

Pull-based message consumption allows Kafka to stay simple and scale well.

Push/callback style abstractions over the consumer are helpful so that developers can focus on developing business value rather than bothering about low-level details. And these can be implemented if not already available for the programming language.

References and Further Study

Apache Kafka

Apache Kafka: A Distributed Streaming Platform.

kafka.apache.org

Kafka Streams: Basic Concepts, Architecture, and Examples

What is Kafka Streams, and how does it work? Learn about the Streams API, architecture, stream topologies, and how to…

developer.confluent.io