Lessons after running Kafka in production

5 min readSep 3, 2022

After running a self-hosted Kafka in production for more than 2 years. Here I have compiled a list of tips and pitfalls.

Background: We have used Kafka as our primary Pub-Sub to set up multiple pipelines and close to 100s of topics with over 5000 pods (clients) connecting to it producing and consuming over 1crore (10 million) messages daily on one of the busiest topics.

1. What is Kafka?

<Skip if you are not a beginner>
Kafka is a message broker which helps us communicate between two pieces of software. It helps us publish messages on “topics” that we can listen to using consumers. A great resource to learn about the basics of Kafka is the tutorials point, important things to understand are topics, groups, offsets, partitions, and log-based working of Kafka. I’ll explain Kafka groups here as it's a pre-requisite for a later section in this article.
Two modes to run in Kafka:

Scenario: You have 10 consumers listening to a single topic, say topic A.

All consumers have different group ids: In this case, every single message on topic A will reach all 10 consumers.
All consumers have the same group id: In this case, messages will be equally divided among all consumers (provided you have the correct number of partitions set.)

2. Where Kafka shines

Kafka works great even with a huge volume of data especially if the packets are small in size.
It ensures “at least once” delivery of messages meaning no messages get lost in transit.
Once you get Kafka running properly with ample RAM and HDD it will run for a really long time and is quite low maintenance, we have had less than 5 kafka-related incidents in over 2 years of its use in production.
Can handle a huge number of connections, with linkedIn reporting over 10k connections per node.
Even our paltry single node, self-hosted Kafka can handle itself well with 5k connections and 100s of topics with many 1000 partition topics.

3. Where Kafka falls short

For self-hosted Kafka — Initial setup is definitely not easy, with new versions coming out frequently, there are many outdated guides, and there's a myriad of errors that would require you to deep dive into Kafka logs. Look at the “Common pitfalls” section of this article for more info.
Big-sized messages: If your message size is over 1Mb, there are configuration changes needed and even with the correct configuration, I found Kafka struggling with sub 10Mb messages with throughput being quite low and requiring an unfeasible amount of consumers to operate without lag.
* IMPORTANT * Uneven distribution of lag: We had a weird scenario running an ML Model, we were running close to 100 pods in GKE consuming from a Kafka topic and doing some CPU-intensive tasks. At times post-peak load, we saw that 30–40 pods among the total hundred were at expected CPU usage levels while the remaining 60–70 pods were close to 0% usage, and Kafka lag was in 1000s meaning there were still ample messages to process. We had a Kafka lag-based scaling setup, meaning the workload won’t scale down and shut down pods unless the lag was at the required levels. So basically the majority of the workers were sitting idle while there was still a lot of work to do. On digging further we found that Kafka assigns all consumers with partitions, the assigning of partitions is in a round robin way, and the single consumer can have more than one partition. We found out that pods with 0% usage were assigned fewer number partitions as compared to pods with high usage. The solution to this problem is described in the “Choosing the number of partitions for a topic” section, and it actually prompted me to write this article.
Speed: The speed with which consumers can consume messages and how quickly you can clear lag on Kafka isn’t the fastest on the market. To prove this we did the following test: On one of the critical topics, we replaced Kafka with GCP pubsub. Metric we tracked: The average time taken from the publisher publishing to consumer consuming. This metric went down from 8–9 minutes to 50 seconds.

4. Choosing the number of partitions for a topic

The “Uneven distribution of lag” issue described in the above section can be partially avoided by choosing an appropriate number of partitions. It depends on a maximum number of cosumers you plan to run, say N, then the number of partitions should always be a multiple of N, to prevent uneven distribution of partitions to the consumers. Now one problem that will continue to happens is if you are running a setup that autoscales based on load, then the number of consumers cannot always be in the correct proportion, and some consumers will have extra partitions. Since Kafka has no way to re-distribute messages among partitions this problem can’t be solved with kafka. Other pubsubs like GCP pubsub do solve this problem completely.

If the number of partitions is too huge, it can cause issues if the number of consumers is very less, and a huge number of partitions requires Kafka to create, open connections, and write to more log files, this can be a problem described in “common pitfalls” section, Open files limit needs to be appropriate in this case.

Generally, if you google this topic, “how to decide the number of partitions” you would get answers like if you want more parallelism you should have more partitions, which to a novice like me, sounded like if you want more scalability, have an obnoxious number so that it isn’t a bottleneck later. But “more parallelism” actually means having at least as many partitions as consumers.

TLDR: Have atleast as many paritions (P) as max number of consumers(C), but number of partitions should always be multiple of number of consumers. P = n * C, n =1,2,3…

5. Curated list of commonly used Kafka commands

I have created a list of my most used command line Kafka commands:

GitHub - hardiktaneja/kafka-commands

Kafka Command line is needed to install these commands Run inside the bin directory of kafka Create a new topic: bash…

github.com

6. Infamous “Kafka is rebalancing” issue

One of the most confusing issues we faced which was related to Kafka but this info log made us think so. Whenever a new consumer joins a consumer group and starts listening to a topic, Kafka assigns it a partition. Same if a consumer leaves a group, its assigned partition will need to be re-assigned.So if you constantly have “Kafka is rebalancing” in your logs, then it means consumers are constantly leaving or being added to your consumer group. So the issue might be the application/script running Kafka consumer crashing prematurely.

7. Common Pitfalls

Open File Limit: Since Kafka is log-based message broker, it needs to write in log files and essentially open many connections in proportion to the number of topics you have and further how many partitions those topics have. Here is resource for this.
Log retention: If this is set incorrectly, the HDD on a server running kafka might get exhausted leading to downtime. Estimate how much data you would send through Kafka in an hour/day/week and set it accordingly.

Thank you for reading this, hope you gained something.

If you have any questions or feedback please feel free to reach out.

Email: hardik.taneja700@gmail.com

LinkedIn: www.linkedin.com/in/hardik-taneja