Sleeping Good At Night — Kafka Configurations Tweaks

Avshalom Orenstein
BigPanda Engineering
4 min readNov 3, 2021

My name is Avshalom and I am a Senior Backend Engineer at BigPanda.
I also lead BigPanda’s Scala Guild.

BigPanda’s pipeline handles millions of events per second using micro services architecture which is heavily dependent on Kafka.We are using Kafka as an events streaming platform which helps our micro services “talk” with each other.

In this article I would like to share with you few tips about Kafka configs.

Important Concepts

lag = the delta between the last produced message and the last consumer’s committed offset

rebalance = a process that occurs in Kafka when one consumer or more are considered dead.

heartbeat = a sign of liveness. Kafka expects each consumer to send a heartbeat every defined period of time, or else it considers the consumer dead.

poll = The poll method is the function a Kafka consumer calls to retrieve records from a given topic.

The interesting Stuff

It was a shiny day, I had a great day at work and I did what I wanted to do. I also sat and laughed at the balcony with my teammates… What can possibly go wrong on such a beautiful day? apparently, my night sleep :((

I went to sleep around 10:30 PM with my phone next to me… I had only good dreams until I got a call at 1:00 AM from PagerDuty, telling me “Something’s broken, it’s your fault, are you gonna fix it?”

I got up and saw one of our services was lagging for most of the orgs. I tried to restart service, monitored for a few minutes, saw everything was okay and decided to check again in the morning. But for the rest of the night, the same thing happened every three hours. It happened again the night after. So we decided to add temp plaster and add auto remediation that invokes restart of the service if this scenario repeats itself.

After a few weeks it got worse. We had to auto remediate almost every day(5–6 times a day). We decided that we have to investigate this issue and find a solution.

We gave 3 days for investigating and implementing a solution. I took this mission on myself. As always when I investigate I put the most attention to the logs and the monitors. I took a few cases and tried to find the similarity and the mutual problems that occur. I found two main problems that were Kafka related.

The first problem was that we were entering a loop of recurring rebalances that continued until we restarted. Going deeper, we found out that once we entered a rebalance we couldn’t get out of it, as we were not able to receive a heartbeat. We kept receiving the error message:
Attempt to heartbeat failed for since member id *** is not valid.

The second problem was that when we had data spikes or bigger events, we started lagging.

The solutions to the first problem were changing one consumer related config that caused rebalance and adding one consumer related config that controls the heartbeat. We changed max-poll-interval-ms from 2 seconds to 20 seconds. This config determines the longest period for expected processing of the batch before it is declaring the consumer as dead. We added session.timeout.ms config with 30 seconds value. This config controls the max time the consumer has to send a heartbeat to the broker.

The solutions for the second problem were changing two consumer related configs that effect the amount and size of data we consume. We changed max-poll-records from 100 to 1000. This config controls the max amount of records the consumer can take in one poll. We also changed max-partition-fetch-bytes from 1MB to 10MB. this config controls the maximum number of bytes the server will return per partition.

Those four configurations finally did the job! five months later, and those problems are no longer interfering with our sleep.

Takeaways

  • When you have rebalance or lag problems always suspect your Kafka configurations
  • max-poll-records and max-partition-fetch-bytes tuning can assist handling with lags issues
  • max-poll-interval-ms and session.timeout.ms tuning can assist handling with rebalance issues
  • Always tune your configurations based on your use case and test it well before applying on production

So, get to know the consumer and producer configurations or you rather stay awake at nights:)

--

--

Avshalom Orenstein
BigPanda Engineering

Staff Engineer/Tech Lead, Specializing in Scala, Kafka and Akka