Resolving the “The coordinator is not aware of this member” Error in Kafka Consumers

Joao Furtado
Mercafacil
Published in
3 min readJul 25, 2023

In Mercafacil, most of our microservices rely on Kafka to deliver asynchronous events between microservices. Recently we ran into the “The coordinator is not aware of this member” Kafka consumer error, leading to messages being processed repeatedly in a loop. Such behavior not only strains resources but can also cause detrimental data inconsistencies, since we have events like sending messages to customers or crediting cashback into customer’s wallets. In this article, we detail the issue, what caused it, and our approach to solving it.

TLDR

For organizations relying on Kafka messages a single time, it’s imperative to:

  • Monitor system logs continuously for the “The coordinator is not aware of this member” error and creating automated alerts around it.
  • Ensure that the session.timeout.ms is large enough to fit the slowest message process time, and thatheartbeat.interval.ms is adjusted accordingly (to 1/3 of the session timeout)

Issue Breakdown

The aforementioned error was primarily triggered by a delayed message processing in one of ourNestJS microservices, which heavily relied on an external API which was subject to high latency from time to time. This disruption led to Kafka perpetually reconsuming a batch of about 20 messages in a perpetual loop on which the Kafka consumer was constantly being reconnected due to the error.

Technical Solution

To address the recurring processing of messages and the consequent error, we adopted a two-pronged approach:

  1. Ensuring Message Uniqueness: By leveraging UUID-based deduplication, we ensured that each message was processed only once, irrespective of whether it was sent multiple times by Kafka.
  2. Kafka Configuration Refinement: To address the root of the error, we fine-tuned specific Kafka consumer configurations, thus optimizing the message acknowledgment mechanism and preventing the error in the first place.

For a deeper understanding of these Kafka configuration parameters, one can refer to the official Kafka documentation.

UUID-based Deduplication

The deduplication mechanism was implemented using a key-value database, with a simple logic like:

const msgUUID = kafkaMessage.uuid;
if (!keyValDB.exists(msgUUID)) {
processMessage(kafkaMessage);
keyValDB.store(msgUUID);
}
/*
The message is also acked if it was already processed, preventing
the reprocess loop
*/

Kafka Configuration Refinement

To prevent the original error from happening, we needed to ensure that the session.timeout.ms configuration is larger than any possible message that the microservice processes. Also following the recommended configurations, heartbeat.interval.ms needs to be adjusted to be at least 1/3 of the new session timeout value. These parameters are described in the official documentation as:

heartbeat.interval.ms: The expected time between heartbeats to the consumer coordinator when using Kafka’s group management facilities. Heartbeats are used to ensure that the consumer’s session stays active and to facilitate rebalancing when new consumers join or leave the group. The value must be set lower than session.timeout.ms, but typically should be set no higher than 1/3 of that value. It can be adjusted even lower to control the expected time for normal rebalances.

session.timeout.ms: The timeout used to detect client failures when using Kafka’s group management facility. The client sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this client from the group and initiate a rebalance.

The Kafka consumer was created with something similar to:

import { KafkaClient } from 'nestjs-kafka';
const kafkaConfig = new KafkaClient({
clientId: 'client-id',
brokers: ['broker:9092'],
consumer: {
groupId: 'group-id',
sessionTimeout: 90000, // large enough to fit any message being processed
heartbeatInterval: 30000 // 1/3 of the session timeout
}
});

Alternative Solutions

Another strategy involves manually invoking the heartbeat while the message is being processed. Although this ensures that the Kafka coordinator is constantly updated about the consumer’s status, it can be complex to implement since your code can block in unpredictable ways.

A Warning on Kafka Consumer’s Prefetch Algorithm

While the above solutions address the specific error, it’s pivotal to be vigilant of Kafka consumer’s prefetch algorithm. This algorithm, in certain configurations, can fetch multiple messages simultaneously. If not handled aptly, this could lead to inadvertent parallel processing and consequent message duplication.

Some references about this issue can be found in kafkajs/issues/1325 and Apache Mailing List Thread.

--

--

Joao Furtado
Mercafacil

Passionate developer, enthusiast of product management and dedicated manager of people.